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This  dissertation  introduces  an  ensemble  learning  method  for  temporal  data  that  uses  a  mixture  of  Hidden  Markov 
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multiple  models  outputs  using  a  decision  level  fusion  method  such  as  an  artficial  neural  network  or  a  hierarchical 
mixture  of  experts.  Our  approach  was  evaluated  on  two  real-world  applications:  (1)  identfication  of  Cardio¬ 
pulmonary  Resuscitation  (CPR)  scenes  in  video  simulating  medical  crises;  and  (2)  landmine  detection  using  Ground 
Penetrating  Radar  (GPR).  Results  on  both  applications  show  that  the  proposed  method  can  identify  meaningful  and 
coherent  HMM  mixture  components  that  describe  different  properties  of  the  data.  Each  HMM  mixture  component 
models  a  group  of  data  that  share  common  attributes.  The  results  indicate  that  the  proposed  method  outperforms  the 
baseline  HMM  that  uses  one  model  for  each  class  in  the  data. 
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For  complex  classification  systems,  data  is  gathered  from  various  sources  and  potentially 
have  different  representations.  Thus,  data  may  have  large  intra-class  variations.  In  fact,  modeling 
each  data  class  with  a  single  model  might  lead  to  poor  generalization.  The  classification  error  can 
be  more  severe  for  temporal  data  where  each  sample  is  represented  by  a  sequence  of  observations. 
Thus,  there  is  a  need  for  building  a  classification  system  that  takes  into  account  the  variations  within 
each  class  in  the  data. 

This  dissertation  introduces  an  ensemble  learning  method  for  temporal  data  that  uses  a 
mixture  of  Hidden  Markov  Model  (HMM)  classifiers.  We  hypothesize  that  the  data  is  generated 
by  K  models,  each  of  which  reflects  a  particular  trend  in  the  data.  Model  identification  could  be 
achieved  through  clustering  in  the  feature  space  or  in  the  parameters  space.  However,  this  approach 
is  inappropriate  in  the  context  of  sequential  data.  The  proposed  approach  is  based  on  clustering  in 
the  log-likelihood  space,  and  has  two  main  steps.  First,  one  HMM  is  fit  to  each  of  the  N  individual 
sequences.  For  each  fitted  model,  we  evaluate  the  log-likelihood  of  each  sequence.  This  will  result 
in  an  N-by-N  log-likelihood  distance  matrix  that  will  be  partitioned  into  K  groups  using  a  relational 
clustering  algorithm.  In  the  second  step,  we  pool  the  sequences  belonging  to  the  same  cluster  into 
K  groups.  Then,  we  learn  the  parameters  of  one  HMM  per  group.  We  propose  using  and  optimizing 
various  training  approaches  for  the  different  K  groups  depending  on  their  size  and  homogeneity.  In 
particular,  we  investigate  the  maximum  likelihood  (ML),  the  minimum  classification  error  (MCE) 
based  discriminative,  and  the  Variational  Bayesian  (VB)  training  approaches.  Finally,  to  test  a  new 
sequence,  its  likelihood  is  computed  in  all  the  models  and  a  final  confidence  value  is  assigned  by 
combining  the  multiple  models  outputs  using  a  decision  level  fusion  method  such  as  an  artificial 
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neural  network  or  a  hierarchical  mixture  of  experts. 

Our  approach  was  evaluated  on  two  real-world  applications:  (1)  identification  of  Cardio- 
Pulmonary  Resuscitation  (CPR)  scenes  in  video  simulating  medical  crises;  and  (2)  landmine  detec¬ 
tion  using  Ground  Penetrating  Radar  (GPR).  Results  on  both  applications  show  that  the  proposed 
method  can  identify  meaningful  and  coherent  HMM  mixture  components  that  describe  different 
properties  of  the  data.  Each  HMM  mixture  component  models  a  group  of  data  that  share  common 
attributes.  The  results  indicate  that  the  proposed  method  outperforms  the  baseline  HMM  that  uses 
one  model  for  each  class  in  the  data. 
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CHAPTER  1 


INTRODUCTION 

Statistical  learning  problems  in  many  fields  involve  sequential  data.  Sequential  data  can 
be  temporal,  i.e.  ordered  with  respect  to  time,  or  non  temporal,  i.e.  ordered  with  respect  to  an 
index  other  than  time,  e.g.  space.  Temporal  data  examples  include  speech  signals,  stock  market 
prices,  Ground  Penetrating  Radar  (GPR)  data,  Cardio-Pulmonary  Resuscitation  (CPR)  scenes,  and 
electroencephalographic  (EEG)  signals.  Examples  of  nontemporal  data  include  pixels  in  an  image, 
protein  sequences,  and  moves  in  a  chess  game.  Although  there  is  no  notion  of  time  as  such  in  non 
temporal  data,  it  has  to  have  an  ordering,  and  thus  can  be  expressed  in  some  form  of  temporal  data. 

One  of  the  key  tasks  in  sequential  data  mining  is  classification,  or  supervised  learning. 
Sequence  classification  task  can  be  formulated  as  follows.  Let  (x^,^)N|  be  a  set  of  N  training 
examples.  Each  example  is  a  pair  of  sequence,  x$,  and  its  label,  yi.  A  classifier  h  is  a  function  that 
maps  from  sequences  to  labels.  In  a  classification  setting,  the  goal  is  to  learn  a  classifier  h  from  the 
available  N  data  samples  in  order  to  predict  the  label  of  a  new  example. 

One  of  the  widely  used  classifiers  for  sequential  data  is  the  Hidden  Markov  Model  (HMM). 
HMMs  were  introduced  and  studied  in  the  late  1960s  and  early  1970s.  They  have  been  extensively 
and  successfully  applied  to  speech  processing  and  recognition  by  Rabiner  and  Huang  [4].  In  the 
1980s,  HMMs  became  the  method  of  choice  for  sequential  data  in  bioinformatics  applications  such 
as  protein  modeling  [5]  and  gene  prediction  [6]. 

An  HMM  is  a  statistical  model  of  a  doubly  stochastic  process  that  produces  a  sequence 
of  random  observation  vectors  at  discrete  times  according  to  an  underlying  Markov  chain.  The 
underlying  Markov  process,  not  directly  observable,  governs  the  transition  of  the  system  from  one 
hidden  state  to  another.  At  discrete  instants  of  time,  the  process  is  assumed  to  be  in  a  given  state 
and  an  observation  is  emitted  by  the  second  stochastic  process  (also  called  observation  emission 
probability)  corresponding  to  the  current  hidden  state.  The  underlying  Markov  chain  then  changes 
its  state  according  to  the  system’s  transition  matrix.  In  an  HMM,  the  observer  sees  only  the 
outputs  of  the  observation  emission  probability  and  can  not  observe  the  states  of  the  underlying 
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Markov  chain. 


In  1989,  Rabiner  and  Huang  [7]  identified  three  key  problems  of  interest  that  need  to  be 
addressed  in  order  for  HMM  to  be  useful  in  real  world  applications.  The  canonical  problems  in 
HMM  literature  are  the  evaluation  problem,  the  decoding  problem,  and  the  inference  problem.  One 
of  the  reasons  for  the  wide  adoption  and  success  of  HMMs  is  the  existence  of  efficient  and  accurate 
algorithms  that  solve  each  of  these  problems.  The  forward-backward  procedure,  the  Viterbi  algo¬ 
rithm  [8],  and  the  Baum  Welch  reestimation  algorithm  [9]  address  respectively  the  aforementioned 
problems. 

The  third  problem,  namely  the  inference  problem,  is  by  far  the  most  challenging  in  HMMs.  It 
can  be  cast  as  follows:  Given  a  set  of  training  data,  how  to  learn  the  model  parameters  that  best  fit  or 
describe  the  data?  The  subtlety  of  the  problem  arises  from  the  definition  of  the  criterion  that  measure 
the  model/data  fit.  Maximum  likelihood  [10],  minimum  classification  error  [11],  and  maximum 
mutual  information  (MMI)  [12]  are  among  such  criteria  that  have  been  used  in  the  literature. 

In  certain  applications,  the  amount  of  data  to  be  analyzed  can  be  too  large  to  be  effectively 
handled  by  a  single  classifier.  Sequential/streaming  data  exhibit  inherently  more  variability  as  the 
characteristics  of  each  class  could  change  over  time.  Sometimes  data  are  collected  at  a  different  time 
or  in  a  different  environment.  Training  a  single  classifier  with  a  vast  amount  of  heterogenous  data 
is  usually  not  practical.  Building  diverse,  uncorrelated,  and  accurate  classifiers  and  combining  their 
outputs  often  proves  to  be  a  more  efficient  approach.  Several  ensemble  learning  approaches  such  as 
boosting  [13],  bagging  [14],  stacked  generalization  [15],  random  subspace  method  [16],  and  simple 
algebraic  combiners  [17]  have  been  introduced  and  proven  to  be  superior  to  individual  classifiers  in 
a  number  of  benchmark  datasets. 

The  most  successful  ensemble  based  systems  (e.g.  AdaBoost[13]  and  Bagging  [14])  are 
typically  meta-classifiers  that  rely  on  some  form  of  resampling  from  the  original  training  data  and 
building  a  base  classifier  for  each  sampling  outcome.  Although  results  on  benchmark  datasets  are 
significantly  better  than  base  classifiers,  these  methods  are  computationally  expensive  and  cannot 
be  applied  efficiently  to  large  data  sets  with  high  levels  of  noise  and  outliers.  Other  non-resampling- 
based  methods  [16]  [17]  can  be  applied  to  large  datasets  but  are  typically  underperformers. 

In  this  work,  we  propose  an  HMM-based  ensemble  classification  method  that  is  based  on 
clustering  sequences  in  the  log-likelihood  space.  Our  approach  is  based  on  partitioning  the  data 
into  smaller  subsets,  training  different  HMM  classifiers  within  the  different  partitions  of  data,  and 
devising  an  adequate  combination  scheme  to  aggregate  the  output  of  the  individual  HMM  classifiers. 
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We  hypothesize  that  the  data  are  generated  by  K  models.  These  different  models  reflect  the 
fact  that  the  different  classes  may  have  different  characteristics  accounting  for  intra-class  variability 
within  the  data.  Model  identification  could  be  achieved  through  clustering  in  the  parameters  space 
or  in  the  feature  space.  However,  this  approach  is  inappropriate  as  it  is  not  trivial  to  define  a 
meaningful  distance  metric  for  model  parameters  or  sequence  comparison.  Our  proposed  approach 
is  based  on  clustering  in  the  log-likelihood  space,  and  has  two  main  steps.  First,  one  HMM  is  fit 
to  each  of  the  N  individual  sequences.  For  each  fitted  model,  we  evaluate  the  log-likelihood  of  each 
sequence.  This  will  result  in  an  N  x  N  log-likelihood  distance  matrix  that  will  be  partitioned  into 
K  groups  using  a  hierarchical  clustering  algorithm.  In  the  second  step,  we  learn  the  parameters  of 
one  HMM  per  group.  We  propose  using  and  optimizing  various  training  approaches  for  the  different 
K  groups  depending  on  their  size  and  homogeneity.  In  particular,  we  will  investigate  the  maximum 
likelihood,  the  MCE-based  discriminative,  and  the  Variational  Bayesian  (VB)training  approaches. 

The  remainder  of  this  dissertation  is  organized  as  follows.  In  chapter  2,  we  outline  the 
background  material  related  to  HMMs  and  to  ensemble  based  classification  methods.  In  chapter 
3,  we  detail  the  main  components  of  the  proposed  ensemble  HMM  classifier.  In  particular,  in 
the  first  part,  we  provide  the  motivations  behind  our  approach  using  an  illustrative  example  from 
the  landmine  detection  application.  In  the  second  part,  we  detail  the  components  of  our  classifier, 
namely  the  log-likelihood  similarity  computation,  the  pairwise-distance-based  hierarchical  clustering 
algorithm,  the  model-per-cluster  learning  component,  and  the  decision  level  fusion  approach.  In 
chapter  4,  we  evaluate  the  proposed  eHMM  approach  on  the  identification  of  CPR  scenes  in  video 
simulating  medical  crises.  In  chapter  5,  we  show  the  experimental  results  of  the  application  of  our 
method  to  landmine  detection  using  GPR  data.  Finally,  in  chapter  6,  we  conclude  and  provide 
potential  future  work. 
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CHAPTER  2 


RELATED  WORK 


2.1  Hidden  Markov  Models 

HMMs  are  widely  used  in  a  variety  of  fields  for  modeling  time  series  data.  Application 
domains  of  HMM  include  speech  recognition  [7],  protein  sequence  modeling  [18,  19],  and  finance 
[20,  21].  The  core  theory  was  first  introduced  by  Baum  et  al  in  [9]  with  primary  focus  on  elementary 
speech  processing.  The  popularity  of  HMMs  soared  subsequently  with  the  works  of  Rabiner  and 
Huang  [7,  4].  Nowadays,  HMMs  are  the  method  of  choice  for  temporal  data  modeling. 

2.1.1  HMM  architecture 

An  HMM  is  a  model  of  doubly  stochastic  process  that  produces  a  sequence  of  random 
observation  vectors  at  discrete  times  according  to  an  underlying  first  order  Markov  chain.  At  each 
observation  time  t,  the  Markov  chain  may  be  in  one  of  the  N  states,  denoted  si,  *  •  •  ,  s/v-  Given  that 
the  chain  is  at  a  certain  state  s*,  it  moves  to  another  state  st+i  according  to  a  st at e-to- state  transition 
probability ,  denoted  A.  At  any  given  state  s*,  the  model  emits  am  observation  ot  according  to  a 
state  emission  probability ,  denoted  B.  Finally  the  probability  that  the  chain  starts  at  a  particular 
state  st  is  governed  by  the  initial  probability  density  function,  denoted  i r. 

Let  T  be  the  length  of  the  observed  sequence,  O  =  [cq,  •  •  •  ,  ot\  be  the  observed  sequence, 
and  Q  =  [<q,  •  •  •  ,  qr\  be  the  hidden  state  sequence.  The  compact  notation 


A  =  (A,B  ,tt)  (1) 

is  generally  used  to  indicate  the  complete  parameter  set  of  the  HMM  model.  In  (1),  A  =  [a^] 
is  the  state  transition  probability  matrix,  where  =  Pr(qt  =  j\qt-i  =  i)  for  i,j  =  1  ,**•  ,7V; 
B  =  bi(ot)  for  i|i  1,  •  •  •  ,  N  and  t  —  1,  •  •  •  ,  T  where  bi(ot )  =  Pr(ot\qt  =  i)  is  the  emission  probability 
distribution  in  state  i;  and  7r  =  7 q  =  Pr{q\  =  i)  are  the  initial  state  probabilities. 

An  HMM  is  called  continuous  if  the  emission  probability  density  functions  are  continuous 
and  discrete  if  the  emission  probability  density  functions  are  discrete.  In  the  case  of  a  discrete  HMM, 
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the  observation  vectors  are  commonly  quantized  into  a  finite  set  of  M  symbols  {vi,  •  •  •  ,  vm,  •  •  •  ,  vm} 
called  the  codebook.  Each  state  is  represented  by  a  discrete  probability  density  function  and  each 
symbol  has  a  probability  of  occurring  given  that  the  model  is  in  that  state.  Thus,  for  the  discrete 
case,  B  becomes  a  simple  set  of  fixed  probabilities  for  each  state. 

2.1.2  HMM  topologies 

Depending  on  the  state  transition  matrix,  an  HMM  can  be  classified  into  one  of  the  following 
types  [22]: 

1.  Ergodic  model:  if  the  associated  Markov  chain  is  ergodic.  That  is,  the  system  can  move  from 
one  state  to  any  other  state.  In  other  words,  all  the  transition  matrix  entries  a^-,  1  <  i,  j  <  N, 
are  non  negative. 

2.  Left-right  model:  if  the  Markov  chain  starts  at  a  particular  initial  state,  traverses  a  number  of 
intermediate  states,  and  finally  terminates  in  a  final  state  (sometimes  called  absorbing  state). 
While  traversing  intermediate  states,  the  chain  may  not  go  backwards  to  the  initial  state. 
Mathematically,  the  transition  matrix,  A,  should  be  such  that  =  0,  Vj  <  i. 

2.1.3  HMM  assumptions 

The  theoretical  framework  of  HMM  is  based  on  three  assumptions  that  make  the  models 
useful  in  real-world  applications  and  allow  for  tractable  computations  within  the  framework. 

1.  Markov  property  assumption  The  standard  practice  is  to  make  a  first  order  Markov  assump¬ 
tion:  The  transition  to  a  new  state  qt+ 1  =  Sj,  given  the  current  and  previous  states,  depends 
only  on  the  current  state  qt  =  Si  through  the  transition  matrix  A  =  aq.  Mathematically,  this 
translates  to:  P(qt+1\qt,qt  ir--  ,qx,  X)  =  P(qt+1\qt,  X) 

2.  Independence  assumption  Here,  it  is  assumed  that  the  current  observation  is  statistically  in¬ 
dependent  from  all  the  previous  observations.  Formally, 

T 

Pr(0\Q,  A)  =  Pr(o\,02,  ■  ■  ■  ,oT\q1,q2,- ■  ■  ,qr,X )  =  Pr(ot\qt,  A).  (2) 

t  =  1 

3.  Stationarity  assumption  This  assumption  states  that  the  model  parameters  are  time-independent. 
Particularly,  for  the  transition  matrix: 

Pr(qtl+ 1  =  sj\qtl  =  sh  A)  =  Pr(qt2+±  =  Sj\qt2  =  s»,  A)  =  a^,  (3) 
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and  for  the  state  emission  pdfs, 


P(ot  1 1 qtl  =  Si,  A)  =  PK  |gt2  =  Si,  A)  ~  &*(.),  (4) 

for  any  G,  £2  in  [1..T]. 

2.1.4  Main  problems  in  HMMs 

Given  the  form  of  the  HMM  defined  in  (1),  Rabiner[7]  defines  three  key  problems  of  interest 
that  must  be  solved  in  order  for  the  model  to  be  useful  in  real  world  applications: 

1.  The  evaluation  problem:  given  the  observation  sequence  O  and  the  model  A,  how  to  efficiently 
evaluate  the  probability  of  O  being  produced  by  the  source  model  A,  i.e.  Pr(0|A)? 

2.  The  decoding  problem:  given  the  observation  sequence  O  and  the  model  A,  how  to  find  the 
most  likely  hidden  state  sequence  q  that  led  to  the  observed  sequence  O? 

3.  The  learning  problem:  also  called  the  training  problem,  consists  of  learning  the  optimal  model 
parameters  given  a  set  of  training  data.  This  problem  is  difficult  because  there  are  several 
levels  of  estimation  required  in  an  HMM.  First,  the  number  of  states  must  be  estimated.  This 
is  usually  inferred  from  the  physical  characteristics  of  the  problem  at  hand  or  performed  using 
a  model  selection  technique.  Then,  the  model  parameters  A  =  (7 r,  A,  B)  need  to  be  estimated. 
In  the  discrete  HMM,  first  the  codebook  is  determined,  usually  using  clustering  algorithms 
such  as  the  K- means  [23],  or  other  vector  quantization  algorithms.  In  the  continuous  HMM, 
and  for  the  case  of  Gaussian  mixture  density  functions,  the  mixture  component  parameters, 
fiij  and  Eij,  are  first  initialized  (usually  by  clustering  the  training  data).  Then  for  both  cases, 
the  parameters  (77,  A,  B)  are  estimated  iteratively. 

2.1.5  The  HMM  classifier 

In  the  previous  sections,  we  introduced  the  architecture,  topologies,  assumptions,  and  the 
three  main  problems  in  hidden  Markov  modeling.  The  reason  for  the  wide  adoption  and  develop¬ 
ment  of  HMMs  for  sequential  data  is  the  presence  of  efficient  algorithms  that  address  and  solve  the 
aforementioned  problems.  HMMs  have  particularly  proven  useful  in  pattern  recognition  and  classi¬ 
fication  applications.  In  a  multiclass  classification  setting,  each  class  c,  c  =  {1, ..,  C},  is  modeled  by 
an  HMM  Ac.  An  observation  O  is  classified  according  to  Bayes  decision  theory  [24],  which  assigns  O 
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to  the  class  whose  model  has  the  maximum  posterior  probability.  That  is,  c  =  arg  maxc  P(Xc\0). 
Bayes  rule  states  that 

P(Q|AC)P(AC) 

'  '  P(O) 

where  P(0  |AC)  is  the  likelihood  of  observed  data  O  given  the  model  for  class  c,  P( Ac)  is  the  prior 
probability  of  model  Ac,  and  P(O)  is  the  evidence  or  the  probability  of  occurrence  of  observation  O. 

Assuming  all  the  models  are  equally  probable,  and  given  that  the  denominator  of  equation 
(5)  does  not  depend  on  c,  maximizing  the  posterior  probability  P(Ac|0)  is  equivalent  to  maximizing 
the  likelihood  P(0 |AC):  c  =  arg  maxc  P(0  |AC).  As  discussed  in  the  next  section,  P(0  |AC)  can  be 
computed  efficiently  using  the  forward-backward  procedure. 

A  less  trivial  task  in  hidden  Markov  modeling  is  the  inference,  or  learning  of  the  model 
parameters  Ac  that  best  describe  the  characteristics  of  class  c.  Several  approaches  to  HMM  model 
learning  will  be  discussed  later  in  this  chapter. 

2.1.6  Solution  to  the  evaluation  problem:  the  forward- backward  procedure 

Given  an  observation  sequence  O  =  [oi,  •  •  *  ,  ot],  the  most  straightforward  way  to  compute 
P(0|A)  is  by  enumerating  every  possible  state  sequence  of  length  T.  There  are  NT  such  states. 
Let  q  =  >  Qt\  denote  one  such  state,  where  q\  is  the  initial  state.  Assuming  statistical 

independence  of  observations  ot,  the  probability  of  the  observation  sequence  O  given  the  state 
sequence  q  and  the  model  A  is: 


T 

P(0|q,A)  =  JJp(ot|<?t,  A)  =bqi(o1)bq2(o2)---bqT(oT)  (6) 

t=  1 

The  probability  of  the  fixed  state  sequence  q  in  the  model  A  is: 

-P(qlA)  —  /Kq1ttq1q2aq2q3  '  '  ’  aqT-iQT  (7) 

The  joint  probability  of  O  and  q  is  the  product  of  the  above  two  terms,  that  is: 

P(0,q|A)=P(0|q,A)P(q|A)  (8) 

To  obtain  the  probability  of  O  given  the  model  A,  we  sum  the  joint  probability  in  (8)  over  all  possible 
state  sequences.  Thus,  we  have 


P(0|A)  =  £P(0|q,A)P(q|A) 
q 

^Moi)  a<?l<?2  K  (02)  dqT_1qT^qT  (°t)  (9) 

qi,q2 ,qT 
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The  evaluation  problem  of  HMMs  using  exhaustive  enumeration  is  intractable  as  the  sum¬ 
mation  in  (9)  is  on  the  order  of  NT  computations.  This  evaluation  is  infeasible  even  for  small  values 
of  N  and  T.  For  instance,  for  N  =  4,  T  =  120,  there  are  on  the  order  of  1072  computations  (on  the 
order  of  the  number  of  electrons  in  the  universe!  [25]).  Fortunately,  the  forward-backward  procedure 
tackles  this  problem  and  reduces  the  complexity  to  N2T  by  exploiting  the  HMM  Markov  property 
and  its  interpretation  as  a  graphical  model. 

2. 1.6.1  Forward  procedure 

Let  the  forward  variable  at(i)  be  the  probability  of  the  partial  distribution  sequence  up  to 

time  t : 

at(i)  =  P{ Oi,o2,  •  •  •  ,o t,qt=  £|A).  (10) 

A  closed-form  for  at(i)  can  be  derived  by  induction  on  t  [7]: 

1.  Initialization 

ol\  (i)  =  7Tibi(oi),  l<i<N.  (11) 

2.  By  induction,  for  1  <  t  <  T  —  1,  and  1  <  j  <  IV,  we  have: 

‘  N 

<*t+i(j)=  ^2at(i)aij  bj(ot+ 1)  (12) 

_i=  1 

3.  At  the  end  we  have: 

N 

Pr(0\\)  =J2ar(i)  (13) 

i= 1 

In  step  2,  the  forward  variable  at  (j )  is  computed  for  each  state  j  and  each  partial  observation  up 
to  time  t.  Given  a  fixed  £,  state  j  can  be  reached  independently  from  any  of  the  N  states  with 
probability  a^-.  The  summation  in  step  2  reflects  this.  Once  at  state  j,  the  system  emits  observation 
symbol  ot  according  to  state  j  probability  distribution  bj(-).  Thus,  using  dynamic  programming, 
the  solution  to  problem  1  is  reduced  to  the  order  of  N2T  computations. 

2. 1.6.2  Backward  procedure 

In  a  similar  way,  we  let  the  backward  variable  /3t(i)  be  the  probability  of  the  partial  distri¬ 
bution  sequence  from  time  t  +  1  to  T  given  the  state  i  at  time  t  and  the  model  A: 

/3t(i)  =  Pr(ot+ 1,  ot+2,  ...,oT\qt=i,  A).  (14) 
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Note  that  here  qt  =  i  has  already  been  given  (it  was  not  the  case  for  the  forward  variable).  This 
distinction  has  been  made  to  be  able  to  combine  the  forward  and  the  backward  variables  coherently 
to  produce  useful  results  that  will  help  solve  problem  2  and  problem  3  of  HMMs,  as  we  shall  see 
soon.  Following  the  same  induction  argument  as  for  cq(i),  /3t(i )  could  be  easily  calculated  using  the 
following  steps  [7]: 

1.  Initialization: 

Pr{i)=hl<i<N8  (15) 


2.  By  induction  and  for  t  =  T  —  1,  T  —  2, . . . ,  1  ,  and  1  <  i  <  N 

N 

Pt(i)  =  aijbj  (ot+i)/3t+i  (j)  (16) 

i=i 

3.  After  termination, 

N 

Pr(0\\)  =  J2*Moi)f3i(i)  (17) 

The  computation  of  Pr(0 |A)  using  /3t(i)  also  involves  the  order  of  N2T  calculations.  Hence  both 
the  forward  as  well  as  the  backward  method  are  equally  efficient  for  the  computation  of  Pr(0  |A). 
This  solves  the  evaluation  problem  in  HMM. 


2.1.7  Solution  to  the  decoding  problem:  the  Viterbi  algorithm 

The  second  problem  in  HMM  consists  of  finding  the  optimal  state  sequence  associated  with 
a  given  observation  sequence.  Unlike  the  evaluation  problem,  for  which  an  exact  solution  could  be 
derived,  there  are  several  possible  ways  for  solving  the  decoding  problem.  The  difficulty  lies  with 
the  definition  of  an  optimality  criterion.  One  possible  choice  for  the  optimality  criterion  is  to  choose 
the  states  that  are  individually  most  likely  at  each  time  t.  In  other  words,  we  have  to  find  a  state 
sequence  q  =  [#i,  #2?  *  •  • ,  qr]  such  that  the  probability  of  occurrence  of  the  observation  sequence 
O  =  [oi,  02, . . . ,  ot\  from  this  state  sequence  is  greater  than  that  from  any  other  state  sequence.  The 
problem  is  then  to  find  q*  that  will  maximize  Pr(0,q|A).  This  can  be  achieved  using  the  Viterbi 
algorithm  [8].  First,  we  define  the  quantity 

St(i)  =  max  P(qu  q2,  ■  ■  ■  ,  qt^i,  qt& i,  o1;  o2,  ■  ■  ■  ,  o(|A)  (18) 

qi,q2,--  ,qt-i 

as  the  highest  probability  along  a  single  path  that  generated  the  partial  sequence  [0102  •  •  •  ot\  and 
ended  in  state  i.  By  induction,  we  have: 


9 


'Wi(j)  =  [max  5t(i)aij]bj  (ot+i) 


(19) 


The  Viterbi  algorithm  can  be  summarized  by  the  following  four  steps: 

1.  Initialization:  for  1  <  i  <  TV 


Si  (i)  =  nibi(oi) 
ipi(i)  =  0 

2.  Recursive  computation:  for  2  <  t  <  T  and  1  <  j  <  N 

StU)  =  ma x  [^(*)a»j]-6j(ot) 

l<i<N 

Mi)  =  argmax  [8t-i{i)ai:j\ 

l<i<N 

3.  Termination: 


p* 


max  IStU)] 
1  <i<N 


q^r  =  arg  max  IStH)  1 
l<i<N 


4.  Tracing  back  the  optimal  state  sequence,  fort  =  T— 1,T  —  2,...,1 


qt  =  A+i(qt+i) 


(20) 

(21) 


(22) 

(23) 


(24) 

(25) 


(26) 


Hence  P*  gives  the  required  state-optimized  probability,  and  q*  =  [q* ,  q\ , . . . ,  q^\  is  the  optimal 
state  sequence.  Computationally,  the  Viterbi  algorithm  is  similar  to  the  forward-backward  procedure 
except  for  the  comparisons  needed  to  find  the  maximum  values  of  5  and  for  the  backtracking  step 
4.  Therefore,  its  complexity  is  also  on  the  order  of  N2T.  It  is  possible  to  implement  the  Viterbi 
algorithm  by  taking  the  logarithm  of  S  hence  reducing  the  complexity  to  N2T  additions  instead  of 
multiplications. 

In  summary,  problems  1  and  2  are  efficiently  solved  using  the  forward-backward  procedure 
and  dynamic  programming.  However,  a  less  trivial  task  in  hidden  Markov  modeling  is  the  inference, 
or  learning  of  the  model  parameters  A  that  best  describe  the  characteristics  of  the  training  data.  In 
the  following,  we  describe  three  different  approaches  to  solve  this  problem. 


10 


2.1.8  Solutions  to  the  inference  problem 


The  parameter  estimation  problem  is  the  most  challenging  problem  of  HMMs.  The  goal  is 
to  adjust  the  model  parameters  (n,  A,B)  based  on  the  observed  sequence(s)  O.  In  fact,  there  is  no 
analytical  solution  for  the  model  parameter  set  that  maximizes  the  probability  of  the  observation 
sequence.  There  are  several  possible  ways  for  defining  an  optimality  criterion.  The  Maximum 
Likelihood  (ML)  [10]  criterion  is  generally  used  to  find  the  set  of  parameter  estimators  that  maximize 
the  likelihood  of  the  data  in  the  HMM  model.  However,  for  the  general  HMM  setting,  there  is  no 
analytical  solution  to  the  Maximum  Likelihood  Estimator  (MLE).  Generally,  iterative  methods,  such 
us  the  Expectation  Maximization  (EM)  algorithm  [26],  are  used  to  approximate  the  MLE. 

Another  widely  used  optimality  criterion  is  the  minimum  misclassification  error  (MCE)  [11]. 
In  the  MCE  training  framework,  the  goal  is  to  find  the  optimal  model  parameters  that  minimize  the 
classification  error  on  the  training  data. 

Both  MLE  and  MCE  are  iterative  procedures  and  could  lead  to  sub-optimal  solutions  (local 
optima).  In  addition,  MLE  is  a  point  estimation  of  the  true  model  parameters  that  maximize  the  pos¬ 
terior  probability.  This  method  requires  a  large  training  data  set  to  get  good  estimates.  In  the  case 
of  insufficient  training  data,  Bayesian  methods  are  preferred.  In  Bayesian  learning,  we  first  set  priors 
on  the  parameters  to  be  estimated.  Then,  we  use  the  available  observations  (evidence)  to  compute 
the  posterior  distribution  over  the  parameters.  Predictions  for  test  data  are  made  by  integrating  out 
the  parameters.  Unfortunately  Bayesian  inference  for  HMMs  is  not  possible  as  it  involves  comput¬ 
ing  intractable  integrals.  Most  existing  methods,  such  as  Markov  Chain  Monte  Carlo  (MCMC)  [27], 
Gibbs  sampling  [28],  and  Laplace  approximations  [25]  either  require  vast  computational  resources  to 
get  accurate  results  or  approximate  all  the  posteriors  via  a  normal  distribution  centered  at  the  ML 
estimate.  The  Variational  Bayesian  (VB)[29]  method  tries  to  approximate  the  posteriors  by  a  class 
of  simpler  functions.  Typically,  the  approximating  functions  should  be  separable/focatorizable  in 
the  model  parameters. 

2. 1.8.1  Maximum  Likelihood  training:  The  Baum- Welch  algorithm 

In  this  section,  we  outline  the  procedure  for  the  re-estimation  of  the  HMM  parameters  using 
the  Expectation  Maximization  algorithm  [26].  In  the  context  of  HMMs,  the  EM  algorithm  is  also 
known  as  the  Baum- Welch  (BW)  algorithm  [9],  or  the  forward-backward  reestimation  procedure. 
The  goal  is  to  approximate  the  maximum  likelihood  estimator  of  the  HMM  parameters,  since  a 
closed-form  of  the  these  parameters  can  not  be  derived. 
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First,  we  define  £t(i,j)  as  the  probability  of  being  in  state  i  at  time  t  and  state  j  at  time 


t  +  1,  given  the  model  and  the  observation  sequence,  i.e. 


£t(i,j)  =  P{qt  =  i,qt+x  =j\0,X)-  (27) 

From  the  definitions  of  the  forward  and  backward  variables  in  equations  (10)  and  (14),  we  can  rewrite 
as 


, .x  P(qt  =  i,qt+i=j,0\X) 

(,(>.3)  =  — ¥m) — 

P(0\\) 

<xt(i)aijbj(ot+i)Pt+i(j)  (28x 

Eili  E^Li  at {i)aij bj  (ot+i )/3t+i  (j) 

Let  7 t(i)  denote  the  probability  of  being  in  state  i  at  time  £,  given  the  observation  O  and  the  model 
A,  i.e., 


jt(i)  =  P(Qt  =  i\0,X).  (29) 

It  can  be  shown  that  7t(i)  is  related  to  £t(hj)  by  summing  over  all  states  j,  i.e., 

N 

7t(i)  =  X7‘ (*’■?')•  (30) 

3  =  1 

Now,  if  we  sum  7t(i)  over  the  time  index  £,  we  get  a  quantity  that  could  be  interpreted  as  the  expected 
number  of  times  that  state  i  is  visited.  Similarly,  summing  £t(i,  j)  over  t  yields  the  expected  number 
of  transitions  from  state  i  to  state  j.  71  (i)  is  interpreted  as  the  expected  frequency  (number  of 
times)  we  started  at  state  i.  Finally,  summing  7t(i)  over  time  steps  at  which  a  particular  symbol  vm 
was  observed  can  be  interpreted  as  the  expected  number  of  times  visiting  state  i  and  observing  vm. 
Using  the  above  interpretations,  we  derive  the  following  equations  for  re-estimating  the  parameters 
of  an  HMM: 


Vi  =  71  (*)> 

T-l 

—  _  t= 1 

aij  ~  T—l 

^2  lt{i) 

t= 1 


(31) 

(32) 
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and 


E  7t(j) 

b~j(m)  =  1-t-Tr°<^m - •  (33) 


Algorithm  1  Baum- Welch  training  algorithm 

Require:  Training  data  \0^\  •  •  •  ,  O^],  0^r^  =  [oi,-  •  • ,  oT].  Fix  the  variables  N  and  M. 

Ensure: 

l:  Cluster  training  data  into  N  clusters,  and  let  Si,  the  center  of  each  cluster,  be  the  representative 
of  state  i 

2:  Cluster  training  data  into  M  clusters,  and  let  vm,  the  center  of  each  cluster,  be  the  mth  element 
of  the  codebook 

3:  Set  the  initial  parameters  using  one  of  the  methods  described  in  section  (2.1.9) 

4:  while  stopping  criteria  not  satisfied  do 

5:  Compute  the  probability  of  the  hidden  states  given  the  data  using  the  forward-backward 

procedure; 

6:  Deduce  the  transition  and  emission  statistics; 

7:  Update  A  using  (32); 

8:  Update  B  using  (33); 

9:  end  while 


A  more  detailed  derivation  of  the  update  equations  for  the  BW  algorithm  could  be  found  in 


[30]. 


2. 1.8.2  Discriminative  training:  Minimum  Classification  Error  criterion 

In  this  section,  we  derive  the  necessary  MCE  update  equations  for  the  HMM  parameters  for 
a  C-class  problem.  In  the  discriminative  MCE  training,  we  seek  to  minimize  the  classification  error. 
Each  random  sequence  O  is  to  be  classified  into  one  of  the  C  classes.  We  denote  these  classes  by 
Cc,c  =  1,2,. ..,(7.  Each  class  c  is  modeled  by  an  HMM  Ac.  Let  O  be  an  observation  sequence  and 
let  gc{0 )  be  a  discriminant  function  associated  with  classifier  c.  In  the  case  of  an  HMM  classifier,  the 
discriminant  function  gc  is  proportional  to  the  posterior  probability  Pr(0  |AC).  Thus,  the  sequence 
O  is  assigned  to  the  class  c  in  which  it  has  maximum  posterior  probability. 

The  classifier  T(O)  defines  a  mapping  from  the  sample  space  O  G  O  to  the  discrete  categorical 
set  Cc,  c  =  1,  2, . . . ,  C.  That  is, 

T(O)  =  arg  maxgc(0).  (34) 

c 

The  misclassification  measure  of  the  sequence  O  is  defined  as: 

i 

7? 

(35) 


dc(0)  =  —gc(0,  A)  +  log 


— - —  y 

C- y 

id/c 


exP[Wj(<4  A)] 


13 


where  r]  is  a  positive  number,  dc(0 )  >  0  implies  misclassification  and  dc(0)  <  0  means  correct 
decision.  When  7  approaches  oo,  the  second  term  of  the  right  hand  side  of  equation  (35)  becomes 
maxjj^c  <7j(0,  A).  The  misclassification  measure  is  then  casted  in  a  smoothed  zero-one  function, 
referred  to  as  loss  function,  defined  as: 


Zc(0;A)  =  Z(dc(0)), 


(36) 


where  /  is  a  sigmoid  function,  one  example  of  which  is: 


i(d)  = 


i 

1  +  exp(— yd  +  6) 


(37) 


In  (37),  the  threshold  parameter  0  is  normally  set  to  zero,  and  the  smoothing  parameter  7  is 
set  to  a  number  larger  than  one.  Correct  classification  corresponds  to  loss  values  in  [0,  ^),  and 
misclassification  corresponds  to  loss  values  in  (|,  1]. 

For  an  unknown  sequence  O,  the  classifier  performance  is  measured  by: 

c 

1(0;  A)  =  ]T  ldO;  A)I(0  e  Cc)  (38) 

c=  1 

where  I(.)  is  the  indicator  function.  The  true  classifier  performance  can  be  measured  by  the  expected 
cost  of  the  error: 

L(A)  =  E[Z(0;  A)].  (39) 


Given  a  set  of  training  observation  sequences  Or,  r  =  1,2  an  empirical  loss  function  on  the 

training  dataset  is  defined  as 

r  c 

KA>  lc(Or;  A)I(Or  e  Cc).  (40) 

r=l  c=l 

The  empirical  loss  L  is  a  smooth  function  in  A  which  can  be  optimized  over  the  training  set  using 
gradient  descent  methods  such  as  the  generalized  probabilistic  descent  algorithm  (GPD)  or  its  vari¬ 
ants  [11].  The  HMM  parameters  are  therefore  derived  by  minimizing  L(A)  using  a  gradient  descent 
algorithm. 

In  order  to  ensure  that  the  estimated  HMM  parameters  satisfy  the  stochastic  constraints  of 
ctij  >  0,  aij  =  1  and  bjk  >  0,  X]j=i  bjk  =  1,  we  map  these  parameters  using 


aij  dij  =  log  a^, 


(41) 


hj  ->  6ij  =  log&^ 


(42) 
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Then,  the  parameters  are  updated  with  regards  to  A.  After  updating,  we  map  them  back  using 

exp  dij 

an  =  ~ Nl - — > 

Xj'=l  exP  aij' 

u  bjk 

bjk  =  — m - =— •  V 

2_/fc'=i  exp&jfe' 

Using  a  batch  estimation  mode,  the  HMM  parameters  are  iteratively  updated  using 


A (r  +  1)  =  A (r)  -  eVAL 


It  can  be  shown  [31]  that  the  update  equations  for  and  are 


~(c)  /  -t  \  ~(c)/  x  9L(A) 

4  (T  + !)  =  44)  -  e^i(sr 

d%  A=A(t 


where 


*jk{T  +  1)  =  a5fe(r)-e^V- 

O03k  A=A(t 


<9L(A)  _  dlm(Or)  ddm(Or)  dg®r  &a\j 

doff  hih [ddm(Or)  dg°r  da\fdd\f 


9L(A)  _  dlm(Or)  ddm(Or)  dg°T  db\j 

dd™(Or)  dg°*  db$dV$ 


Substituting  the  partial  derivatives,  we  get  the  gradient  direction  of  the  update  equations  for  the 
parameters  of  the  HMM: 


<9L(A) 


R  C  T 


=  EEE  7im(Or,A)(l  -  Zm (Or,  A))  X 


r=  1  m=  1  t=l 


5(g[  =  i,gAi=i)(l 


(c)\  ddc(Or) 

hj  %ro(Or,A)’ 


0L(A) 


R  C  T 


=  EEE  llmiPr,  A)(l  -  Zm(Or,A))  x 


r=l  m=l  t=l 


<%[  =  3,  Qv(ort)  =  k)(  1  -  6 A 


(c)\  &dc  (C)  ,■) 


dg 

rn  (Or ,  A) 


ddc(Q) 
dgm(0,  A) 


—  1  if  c  =  m 

expire (Q, A)]  if  777 

E,-,,^cexp[TO(0,A)]  11  C^m 
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Algorithm  2  MCE/GPD  training  algorithm 

Require:  Training  data  [0^\-  •  •  ,  O^],  0^  =  [o\^  •  • ,  ot\-  Fix  the  HMM  structure  variables  N 
and  M  for  each  model  Ac. 

Ensure: 

l:  For  each  Ac,  cluster  training  data  into  N  clusters,  and  let  the  center  of  each  cluster,  be  the 
representative  of  state  i. 

2:  For  each  Ac,  cluster  the  training  data  into  M  symbols.  The  center  of  each  cluster  vm  is  a  symbol 
of  the  codebook. 

3:  while  stopping  criteria  not  satisfied  do 

4:  Compute  the  loss  function  of  each  sequence  O  using  (40); 

5:  Update  A  of  each  Ac  using  (46); 

6:  Update  B  of  each  Ac  using  (47); 

7:  end  while 


and  S  is  the  2-D  Kronecker’s  delta  function.  The  MCE-based  discriminative  training  is  outlined  in 
Algorithm  2. 

A  more  detailed  derivation  of  the  update  equations  for  the  MCE-based  discriminative  train¬ 
ing  algorithm  could  be  found  in  [30]. 

2. 1.8.3  Variational  Bayesian  training 

The  Bayesian  framework  provides  a  solution  to  the  over-fitting  problem.  Unfortunately,  com¬ 
putations  in  the  Bayesian  framework  can  be  intractable.  Most  existing  methods,  such  as  Markov 
Chain  Monte  Carlo  (MCMC)  [27]  and  the  Laplace  approximation  [25]  either  require  vast  computa¬ 
tional  resources  to  get  good  estimates  or  they  crudely  approximate  all  the  posteriors  via  a  normal 
distribution.  The  variational  Bayesian  (VB)  method  attempts  to  approximate  the  integration  as 
accurately  as  possible  while  remaining  computationally  tractable  [29]. 

Let  O  ,  Z,  ©,  and  A  denote  the  training  data  set,  latent  (hidden)  variables,  a  set  of  model 
parameters,  and  the  model  identity  itself,  respectively.  In  VB,  the  aim  is  to  maximize  the  marginal 
likelihood  defined  as: 

Pr(0\\)  =  J  Pr{0,Z,Q\\)dZdO.  (49) 

The  VB  approach  introduces  a  variational  posterior  distribution  p(Z,  ©|A)  of  the  model  parameters 
and  hidden  variables.  Taking  the  logarithm,  then  the  expectation  with  respect  to  the  distribution 
p(Z,Q | A),  and  applying  Jensen’s  inequality  [32],  we  obtain: 

log  Pr(0\X)  =  F(p)  +  KL(p\ \Pr)  (50) 

where 

m  =  J p(Z , 0|A)  log  (51) 
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and 


KL(p\\Pr)  =  J  p(Z,  0|A)  log  ^^LdZdQ.  (52) 

The  term  F(p)  represents  a  lower  bound  of  logPr(0|A).  The  aim  of  VB  learning  is  to  maximize 
this  lower  bound  by  tuning  the  variational  posterior  p(Z,  ©|A)  such  that  as  the  variational  posterior 
approaches  the  true  posterior,  the  bound  becomes  tight  and  the  marginal  likelihood  can  be  effi¬ 
ciently  approximated  efficiently.  The  equality  between  F(p)  and  logPr(0|A)  occurs  if  and  only  if 
p(Z,  ©|A)  =  Pr(Z,  ©|A,0).  In  practice,  we  use  a  factorized  approximation  p(Z,  ©)  «  Pz(Z)p®(Q). 
Thus,  P  can  be  written  in  the  form: 


F(p)  =  j  Pz(Z)pe(Q)  log  =  F(pz(Z),pe(G),  O,  A)  (53) 

The  quantity  P  can  be  viewed  as  a  functional  of  the  free  distributions  pz{Z )  and  p©(@).  The 
variational  Bayesian  (VB)  algorithm  iteratively  maximizes  P  in  equation  (53)  with  respect  to  the 
free  distributions  pz{Z)  and  p©(@).  Taking  the  partial  derivatives  of  P  with  respect  to  the  free 
distributions,  and  setting  it  to  zero,  results  in  the  following  update  equations: 


p^+1\z)  oc  exp  J  Pr(Z,0\Q,  \)p(Q  (Q)dQ 


and 


p©+1\©)  oc  Pr(© | A)  exp 


J  Pr(Z,0\G,X)p(z+1\z)dZ 


(54) 


(55) 


where  subscript  (t)  denotes  the  iteration  number.  Clearly,  pz^Zi)  and  p©(@)  are  coupled,  so  we 
iterate  these  equations  until  convergence. 

The  variational  Bayesian  training  steps  is  outlined  in  Algorithm  3. 


Algorithm  3  VB  training  algorithm 

Require:  Training  data  [0^\  •  •  •  ,  O^],  0^  =  [oi,-  •  • ,  ot]-  Fix  the  variables  N  and  M. 

Ensure: 

l:  Cluster  training  data  into  N  clusters,  and  let  s^,  the  center  of  each  cluster,  be  the  representative 
of  state  i 

2:  Cluster  training  data  into  M  clusters,  and  let  t?m,  the  center  of  each  cluster,  be  the  kth  element 
of  the  codebook 

3:  Set  priors  on  the  HMM  parameters. 

4:  while  stopping  criteria  not  satisfied  do 

5:  Update  marginal  posteriors  on  the  hidden  variables  Z  using  equation  (54); 

6:  Update  marginal  posteriors  on  the  parameters  ©  using  equation  (55); 

7:  end  while 


A  more  detailed  derivation  of  the  update  equations  for  the  VB  training  algorithm  could  be 
found  in  [29]. 
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2.1.9  Initialization  of  the  HMM  parameters 


A  key  question  in  HMM  is  how  to  choose  initial  estimates  of  the  HMM  parameters  to  improve 
the  likelihood  of  reaching  the  global  minimum  and  the  rate  of  convergence.  A  number  of  initialization 
methods  have  been  proposed  [7].  Examples  include: 

•  Random:  the  HMM  parameters  are  generated  randomly  from  a  uniform  distribution. 

•  Manual  segmentation:  when  the  hidden  states  have  a  physical  meaning,  manual  partitioning 
could  be  performed  to  split  the  data  into  the  different  states  of  the  HMM  and  then  the 
remaining  parameters  could  be  derived  [33]. 

•  Segmental  k-means:  starting  form  a  random  guess  of  the  HMM  parameters,  and  using  the 
Viterbi  algorithm  to  label  the  observation  sequences,  the  segmental  k-means  clusters  the  se¬ 
quences  to  learn  the  HMM  parameters  [7]. 

2.2  MultiStream  HMMs 

For  complex  classification,  multiple  sources  of  information  may  contribute  to  the  generation 
of  sequences.  In  the  standard  HMMs,  the  different  features  contribute  equally  to  the  classification 
decision.  However,  these  features  may  have  different  relevance  degrees  that  depend  on  different 
regions  of  the  feature  space.  Thus,  more  complex  HMM  based  structures  were  proposed  [34,  30]  to 
handle  temporal  data  with  multiple  sources  of  information.  Approaches  toward  the  combination  of 
different  modalities  can  be  divided  into  three  main  categories: 

•  Feature  level  fusion:  multiple  features  are  concatenated  into  a  large  feature  vector  and  a  single 
HMM  model  is  trained  [35].  This  type  of  fusion  has  the  drawback  of  treating  heterogeneous 
features  equally  important.  It  also  cannot  represent  the  loose  timing  synchronicity  between 
different  modalities  easily. 

•  Decision  level  fusion:  the  modalities  are  processed  separately  to  build  independent  models 
[36].  This  approach  completely  ignores  the  correlation  between  features  and  allows  complete 
asynchrony  between  the  streams. 

•  Model  level  fusion:  an  HMM  model  that  is  more  complex  than  a  standard  one  is  sought.  This 
additional  complexity  is  needed  to  handle  the  correlation  between  modalities,  and  the  loose 
synchronicity  between  sequences. 
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Several  HMM  structures  have  been  proposed  in  the  context  of  model  level  fusion.  Examples  include 
factorial  HMM  [37],  coupled  HMM  [38]  and  Multi-stream  HMM  [34].  Both  factorial  and  coupled 
HMM  structures  allow  asynchrony  between  sequences  since  a  separate  state  sequence  is  assigned  to 
each  stream  [39].  However,  this  is  performed  at  the  expense  of  an  approximate  parameter  estimation. 
In  fact,  the  parameters  of  factorial  and  coupled  HMMs  could  be  estimated  via  the  EM  (Baum- Welch) 
algorithm  [26].  However,  the  E-  step  is  computationally  intractable  and  approximation  approaches 
are  used  instead  [38,  37]. 

Multi-Stream  HMM  (MSHMM)  [34,  30]  is  an  HMM  based  structure  that  handles  multiple 
modalities  for  temporal  data.  In  [30] ,  weights  are  assigned  to  the  different  streams  and  incorporated 
into  the  model  parameters.  The  added  complexity  of  model  parameters  makes  the  inference  task 
more  challenging  and  in  some  cases,  even  the  ML  training  becomes  intractable.  In  [30],  multiple 
MSHMM  structures  for  the  discrete  and  continuous  HMM  have  been  proposed.  Generalized  training 
framework  combining  the  ML  and  the  MCE  training  were  derived  to  learn  the  MSHMM  parameters 
simultaneously. 

MSHMM  addresses  the  issue  of  fusing  multiple  features.  Another  problem  in  complex  real- 
world  applications  is  that  a  single  model  is  not  sufficient  to  accurately  handle  the  intra-class  vari¬ 
ability.  To  alleviate  this  limitation,  multiple  simple  models  could  be  used  and  fused.  The  models 
need  not  to  be  accurate  on  all  the  regions  of  the  feature  space.  The  multiple  model  approach  can 
be  regarded  as  ensemble  learning.  Generally,  an  ensemble  learning  method  includes  two  main  steps 
that  could  potentially  be  intertwined:  the  creation  and  the  fusion  of  the  multiple  models.  Different 
approaches  of  ensemble  learning  are  described  in  the  following  section. 

2.3  Ensemble  learning  methods 

In  this  section,  we  outline  some  ensemble  learning  approaches  that  have  been  extensively 
studied  and  used  in  many  applications. 

2.3.1  Bagging 

Bagging  is  considered  as  a  method  of  independent  construction  of  the  base  classifiers  [14] .  It 
consists  of  running  a  learning  algorithm  several  times  with  random  samples  from  the  training  data 
for  each  run.  Given  a  set  of  N  training  data  points,  at  each  iteration,  bagging  chooses  a  set  of  data 
points  of  size  m  <  N  via  uniform  sampling  with  replacement  from  the  original  data  points.  At  each 
run,  some  data  points  will  appear  multiple  times,  other  data  points  will  not  appear  at  all  in  the 
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resampled  dataset.  If  the  base  classifier  is  unstable,  i.e.  a  small  variation  in  the  training  data  leads 
to  large  change  in  the  resulting  classifier,  the  bagging  will  result  in  a  diverse  set  of  hypotheses.  On 
the  other  hand,  if  the  base  classifier  is  stable,  Bagging  may  underperform  the  base  classifier. 

2.3.2  Boosting  (AdaBoost) 

Following  the  taxonomy  in  [40] ,  Boosting  is  a  method  of  coordinated  construction  of  ensem¬ 
bles.  While  in  bagging  the  samples  are  drawn  uniformly  with  replacement,  in  boosting  methods  the 
learning  algorithm  is  called  at  each  iteration  using  a  different  distribution  over  the  training  exam¬ 
ples.  At  iteration  t  +  1,  this  technique  places  higher  weights  on  the  examples  misclassified  by  the 
base  classifier  at  iteration  t.  Let  =  ±1}  be  a  set  of  N  labeled  training  examples.  The 

goal  is  to  construct  a  weighted  sum  of  hypothesis  such  that  iL(x^)  =  wtht(xi)  has  the  same  sign 
as  yi.  The  algorithm  operates  as  follows.  Let  dt(x^)  be  the  weight  of  data  point  at  iteration  t 
of  the  algorithm.  Initially,  all  training  data  points  are  given  a  uniform  weight  (di(x^)  =  1  /N). 
At  iteration  £,  the  underlying  learning  algorithm  constructs  hypotheses  ht  to  minimize  the  weighted 
training  error.  The  resulting  weighted  error  on  the  training  set  is  rt  =  The 

weight  assigned  to  hypothesis  ht  is 


1 7  ^(1  +  ^t)^ 

“■  =  2,n((T^))- 

For  each  data  point  i,  the  weight  for  the  next  iteration  is  updated  using 

,  ,  ,  ,  ,  xexp (-wtyiht(xi)) 

dt+i(xi)  =  dt(xi) - - - , 


(56) 


(57) 


where  Zt  is  a  normalizing  factor. 

It  has  been  shown  in  [41]  that  this  algorithm  is  a  form  of  gradient  optimization  in  the 
function  space  with  J(H )  =  JT  exp(— ^iL(x^))  as  the  objective  function.  The  quantity  ^iL(x^) 
could  be  interpreted  as  the  margin  by  which  x^  is  correctly  classified.  Minimizing  J  is  equivalent  to 
maximizing  the  margin.  Work  by  Vapnik  [42]  on  SVMs  (Support  Vector  Machines)  derives  upper 
bounds  on  the  generalization  error  with  regards  to  the  margins  of  the  training  data.  Freund  et  al 
[43]  extended  the  margin  analysis  to  Adaboost.  The  generalization  error  is  bounded  by  the  fraction 
of  training  data  for  which  the  margin  is  less  than  a  quantity  ©  >  0,  plus  a  term  that  grows  as 


d  Injj) 
N  0 


(58) 


where  d  is  a  measure  of  the  expressive  power  of  the  hypothesis  space  from  which  the  individual 
classifiers  are  drawn,  known  as  the  VC-dimension.  The  value  of  0  can  be  optimized  such  that  the 
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bound  in  (58)  is  minimized.  Experimentally  Adaboost  has  been  proven  to  increase  the  margins 
on  the  training  data  points  which  explains  partly  its  good  generalization  performance  on  new  data 
points. 

Although  Adaboost  shows  significant  improvement  over  existing  classification  methods  on 
benchmark  datasets,  it  suffers  from  the  following  drawbacks: 

•  Adaboost  does  not  perform  well  on  high  noise  setting  as  it  tends  to  put  very  high  weights  on 
the  noisy  data  points 

•  The  upper  bounds  on  the  generalization  error  are  not  tight  and  can  not  fully  explain  the 
success  of  Adaboost  across  multiple  benchmark  datasets.  Dietterich  [44]  noticed  that  it  is 
possible  to  design  algorithms  that  are  more  effective  than  Adaboost  at  increasing  the  margin 
on  the  training  data,  but  these  algorithms  have  very  poor  generalization  error  compared  to 
Adaboost.  Also,  it  is  worth  noting  that  when  using  large  decision  trees  and  neural  networks 
as  base  classifiers,  Adaboost  works  very  well  even  though  these  base  classifiers  are  known  to 
have  high  VC-dimensions. 

2.3.3  Hierarchical  mixture  of  experts 

The  HME  approach,  introduced  by  Jordan  et  al  in  [45],  is  a  tree  structured  architecture 
which  is  based  on  the  principle  of  divide- and- conquer.  The  model  uses  a  probabilistic  approach  to 
partition  the  input  space  into  overlapping  regions  on  which  the  ” experts”  act.  The  leaves  of  the 
tree  are  denoted  ” expert  networks”,  and  the  non-terminal  elements  of  the  tree  are  called  ” gating 
functions” .  Each  gating  function  is  associated  with  a  node  (expert  network)  in  the  tree.  It  weights 
the  outputs  of  the  subtrees  coming  into  it,  to  form  a  combined  output.  The  final  network  output  is 
obtained  at  the  root  of  the  tree.  An  illustration  of  a  HME  with  2  levels  and  a  common  branching 
factor  of  2  at  each  level  is  shown  in  Figure  1.  In  the  general  architecture,  multiple  levels  and 
branching  factors  are  possible.  However,  the  model  structure  should  be  fixed/predetermined  before 
proceeding  to  training. 

Expert  network  (i,  j)  produces  its  output  /i^-  as  a  generalized  linear  function  of  the  input  x: 

Hij  =  f(Uijx),  (59) 

where  Uij  is  a  weight  matrix  and  /  is  a  fixed  continuous  nonlinearity.  For  regression  problems,  / 
is  generally  chosen  to  be  the  identity  function.  For  binary  classification,  /  is  generally  taken  to  be 
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Figure  1.  Block  diagram  of  a  2-level  HME 


the  logistic  function.  Other  choices  for  /  could  be  taken  to  handle  the  application  at  hand  (e.g., 
multiway  classification,  counting,  rate  estimation). 

The  gating  networks  are  also  generalized  linear  functions  of  the  input  x.  Let  =  v^x, 
where  is  a  weight  vector.  Then  the  ith  output  of  the  top  level  gating  network  is  the  ’’soft max” 
distribution  of  the 


9i  = 


eii 


(60) 


It  is  worth  noting  that  the  s  are  positive  and  sum  to  one  for  each  x.  They  can  be  viewed  as  a 
’’soft”  partitioning  of  the  input  space.  Similarly,  the  gating  networks  at  lower  levels  are  generalized 
linear  systems.  Thus,  we  define  intermediate  variables,  as  =  v-x,  and  let 


9j\i 


Efc eiik 


(61) 


be  the  output  of  the  jth  unit  in  the  ith  gating  network  at  the  second  level  of  the  architecture.  The 
output  vector  at  each  nonterminal  node  of  the  tree  is  the  weighted  sum  of  the  experts  below  that 
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node: 


Mi  —  ^  y  MjliMij;  (^2) 

3 

and  the  output  at  the  top  level  of  the  tree  is: 

M  =  (63) 

The  hierarchy  can  be  given  a  probabilistic  interpretation.  First,  we  refer  to  gi  and  g^  as  the 
prior  probabilities,  as  they  depend  only  on  x.  Then,  for  each  expert,  we  assume  that  the  true  output 
y  is  chosen  from  a  distribution  P  with  mean  fiij.  Therefore,  the  total  probability  of  generating  y 
from  x  is  obtained  by  summing  over  all  the  possible  paths  to  the  root  of  the  tree,  that  is: 


P{y  |x,0)  =  9j\iP(y\*-,0ij)-  (64) 

i  3 

To  develop  the  EM-based  learning  algorithm  for  HME,  the  posterior  probabilities  associated  with 
each  node  in  the  tree  are  derived  according  to  Bayes’  rule: 


_  9i  T,j  9j\jPij(y) 

T,i9iJ2j9j\iPij(y) 

.  gjjiPAy) 
ili  Ej  9j\iPij(y) ' 

Finally,  the  joint  posterior  probability  /qj,  is  the  product  of  hi  and  hj\f. 


(65) 

(66) 


h  =  9i9j  jPjjiy) 

J2i9iJ2j9j\iPij(y)' 

This  quantity  is  the  probability  that  expert  network  (i,  j)  has  generated  the  data,  based  on  the 
knowledge  of  both  the  input  and  the  output.  It  should  be  emphasized  that  the  above  quantities  are 
conditional  on  the  input  x. 


i(e,x)  =  5>Z>!fc)  E4)jP^(y(fe))-  (68) 

hi  j 

The  problem  of  learning  in  HME  can  be  casted  as  a  maximum  likelihood  estimation  problem. 
Given  a  data  set  X  =  {*k,yk}k=n  log-likelihood  is  computed  by  taking  the  log  of  the  product 
of  N  densities  of  the  form  of  equation  (64).  An  analytical  solution  that  minimizes  the  likelihood 
in  (68)  could  not  be  derived,  as  it  involves  three  summations  and  a  logarithm  term.  In  practice, 
iterative  methods  such  as  gradient  descent  [45] ,  and  expectation-maximization  [46]  algorithms  were 
proposed  in  the  HME  literature. 
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2.3.4  Local  fusion 


One  way  to  categorize  classifier  combination  methods  is  based  on  the  way  they  select  or  assign 
weights  to  the  individual  classifiers.  Some  methods  are  global  and  assign  a  degree  of  worthiness, 
that  is  averaged  over  the  entire  training  data,  to  each  classifier.  Other  methods  are  local  and  adapt 
the  classifiers’  performance  to  different  data  subspaces. 

In  [47],  a  local  fusion  approach,  called  Context-Dependent  Fusion  (CDF),  has  been  proposed. 
CDF  has  two  main  steps.  First,  it  uses  a  standard  clustering  algorithm  to  partition  the  training 
signatures  into  groups  of  signatures,  or  contexts.  Then,  the  fusion  parameters  are  adapted  to  each 
context.  CDF  treats  the  partitioning  of  the  feature  space  and  the  selection  of  local  expert  classifiers 
as  two  independent  processes  performed  sequentially. 

In  [48],  a  generic  framework  for  context-dependent  fusion,  called  Context  Extraction  for 
Local  Fusion  (CELF)  was  proposed.  CELF  jointly  optimizes  the  partitioning  of  the  feature  space  and 
the  fusion  of  the  classifiers.  The  partitioning  of  the  feature  space  includes  a  feature  discrimination 
component  to  identify  clusters  in  subspaces  in  possibly  high-dimensional  feature  space.  Within  these 
subspaces  (called  also  contexts),  a  linear  aggregation  of  the  different  algorithms  outputs  provides  a 
better  classification  rate  than  all  the  individual  classifiers  and  the  global  approach. 

Both  CDF  and  CELF  use  multiple  algorithms  and  features.  Any  algorithm/features  pair 
can  be  used  as  long  as  the  algorithms  have  a  continuous,  probability-like  output  and  the  features 
are  metric.  However,  these  approaches  are  not  applicable  in  the  context  of  sequential  data  as  there 
is  no  robust  similarity  measure  between  two  sequences.  Also,  the  aforementioned  approaches  do  not 
intervene  at  the  model  creation  and  training  level. 

2.4  Multiple  models  fusion 

For  complex  detection  and  classification  problems  involving  data  with  large  intra-class  varia¬ 
tions  and  noisy  inputs,  good  solutions  are  difficult  to  achieve,  and  no  single  source  of  information  can 
provide  a  satisfactory  solution.  As  a  result,  combination  of  multiple  classifiers  (or  multiple  experts) 
is  playing  an  increasing  role  in  solving  these  complex  pattern  recognition  problems,  and  has  proven 
to  be  a  viable  alternative  to  using  a  single  classifier. 

There  are  many  taxonomies  for  classifier  combination  methods  :  trainable  vs.  non  trainable, 
class  label  combination  vs.  continuous  output  combination,  classifier  selection  vs.  classifier  fusion, 
etc. 
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In  the  latter  taxonomy,  classifier  selection  methods  put  an  emphasis  on  the  development  of 
the  classifier  structure.  First,  these  methods  identify  the  single  best  classifier  or  a  selected  group  of 
classifiers  and  then  only  their  outputs  are  taken  as  a  final  decision  or  for  further  processing.  This 
approach  assumes  that  the  classifiers  are  complementary,  and  that  their  expertise  varies  according 
to  the  different  areas  of  the  feature  space.  For  a  given  test  sample,  these  methods  attempt  to 
predict  which  classifiers  are  more  likely  to  be  correct.  Some  of  these  methods  consider  the  output 
of  only  a  single  classifier  to  make  the  final  decision  [49].  Others,  combine  the  output  of  multiple 
”  local  expert”  classifiers  [50].  Classifier  fusion  methods  operate  mainly  on  the  classifiers  outputs, 
and  strive  to  combine  the  classifiers  outputs  effectively.  This  approach  assumes  that  the  classifiers 
are  competitive  and  equally  experienced  over  the  entire  feature  space.  For  a  given  test  sample,  the 
individual  classifiers  are  applied  in  parallel,  and  their  outputs  are  combined  in  some  manner  to  take 
a  group  decision. 

Another  way  to  categorize  classifier  combination  methods  is  based  on  the  way  they  select 
or  assign  weights  to  the  individual  classifiers.  Some  methods  are  global  and  assign  a  degree  of 
worthiness,  that  is  averaged  over  the  entire  training  data,  to  each  classifier. 

2.5  Chapter  summary 

In  this  chapter,  we  presented  the  background  related  to  the  proposed  work.  We  first  laid 
down  the  fundamentals  of  Hidden  Markov  modeling  and  its  use  as  a  sequential  data  classifier.  We 
presented  the  different  approaches  to  address  complex  classification  problems  with  noisy,  large,  intra¬ 
class  variable  data  using  fusion  methods.  We  then  argued  for  the  need  to  use  multiple  models  and 
to  efficiently  combine  their  outputs.  We  also  presented  several  ensemble  learning  methods  that  are 
related  to  our  approach.  In  the  last  section  of  this  chapter,  we  detailed  several  state-of-the-art 
methods  and  approaches  for  combining  multiple  models. 
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CHAPTER  3 


ENSEMBLE  HMM  CLASSIFIER 


3.1  Introduction 

In  this  chapter,  we  start  by  motivating  the  need  for  our  eHMM  approach.  Then,  we  outline 
the  architecture  of  the  proposed  method  and  its  components.  The  intermediate  steps  of  the  eHMM 
are  illustrated  using  examples  from  landmine  detection  application  using  Ground  Penetrating  Radar 
(GPR)  signatures  and  Edge  Histogram  Descriptors  (EHD)  features.  The  details  of  the  application  of 
eHMM  to  landmine  detection  using  GPR  signatures  and  the  results  will  be  described  subsequently 
in  detail  in  chapter  5.  The  EHD  features,  along  with  other  features  representations  for  the  GPR 
signatures,  are  outlined  in  section  5.3  of  the  same  chapter. 

In  the  following  sections,  eHMM  refers  to  both  the  discrete  (eDHMM)  and  continuous 
(eCHMM)  case  seemingly,  except  where  otherwise  stated  (e.g.  individual  model  initialization,  clus¬ 
ters’  models  construction). 

3.2  Motivations 

For  a  two  class  problem,  the  baseline  HMM  models  each  class  by  a  single  model  learned  from 
all  the  observations  within  that  class.  The  goal  is  to  generalize  from  all  the  training  data  in  order  to 
classify  unseen  observations.  However,  for  complex  classification  problems,  combining  observations 
with  different  characteristics  to  learn  one  model  might  lead  to  too  much  averaging  and  thus  loosing 
the  discriminating  characteristics  of  the  observations. 

To  illustrate  this  problem,  we  use  an  example  of  detecting  buried  landmines  using  GPR 
sensors1.  In  this  case,  the  training  data  consists  of  a  set  of  N  GPR  alarms  labeled  as  mines  (class 
1)  or  clutter  (class  0).  The  goal  is  to  generalize  from  the  training  data  in  order  to  classify  unlabeled 
GPR  signatures.  In  figure  2,  we  show  three  groups  of  mines  with  different  strengths.  It  is  obvious 
that  grouping  all  of  these  signatures,  to  learn  a  single  model,  would  lead  to  poor  generalization. 
Similarly,  the  false  alarms  could  be  caused  by  different  clutter  objects  and  varied  environment 

1The  details  of  the  landmine  detection  application  using  GPR  signatures  will  be  presented  in  chapter  5 
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conditions  and  could  have  significant  variations.  The  above  problem  is  more  acute  when  data  is 
collected  by  multiple  sensors  and/or  using  various  features. 


(a)  Mines  with  strong  signature.  (b)  Mines  with  average  signature. 


(c)  Mines  with  weak  signature. 

Figure  2.  Mine  signatures  manually  categorized  into  three  groups. 

Consequently,  learning  a  set  of  models  that  reflect  different  characteristics  of  the  observations 
might  be  more  beneficial  than  using  one  global  model  for  each  class.  In  this  dissertation,  we  develop 
a  new  approach  that  replaces  the  two-model  classifier  with  multiple  models  classifier.  For  instance, 
each  group  of  signatures  in  figure  2  would  be  used  to  learn  a  different  model.  Our  approach  aims  to 
capture  the  characteristics  of  the  observations  that  would  be  lost  under  averaging  in  the  two-model 
case. 

We  hypothesize  that  under  realistic  conditions,  the  data  are  generated  by  multiple  models. 
The  proposed  approach,  called  ensemble  HMM  (eHMM),  attempts  to  partition  the  training  data 
in  the  log-likelihood  space  and  identify  multiple  models  in  an  unsupervised  manner.  Depending  on 
the  clusters  homogeneity  and  size,  a  different  training  scheme  is  applied  to  learn  the  corresponding 
HMM  parameters.  In  particular,  for  homogeneous  clusters  with  mainly  observations  from  the  same 
class,  the  HMM  is  trained  using  the  standard  Baum- Welch  algorithm  [9].  For  clusters  containing 
observations  from  the  two  classes,  the  corresponding  HMM  is  trained  using  a  discriminative  training 
scheme  based  on  the  minimization  of  the  misclassification  error  (MCE)  criterion  [11].  For  clusters 
with  small  number  of  observations,  the  corresponding  HMM  parameters  are  learned  using  the  varia¬ 
tional  Bayesian  approach  [29] .  The  resulting  K  HMMs  are  then  aggregated  through  a  decision  level 
fusion  component  to  form  a  descriptive  model  for  the  data. 
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3.3  Ensemble  HMM  architecture 

Let  0=  {Or,  yr}r Li  be  a  set  of  R  labeled  sequences  of  length  T  where  Or  =  {0^\  •  •  •  ,  O^} 
and  yr  G  {1,  •  •  •  ,  C}  is  the  label  (class)  of  the  sequence  Or.  The  proposed  alternative  for  baseline 
HMM  consists  of  the  construction  of  a  mixture  of  K  HMMs,  A  =  {Xk}k=n  to  cover  the  diversity  of 
the  training  data.  The  eHMM  has  four  main  components: 

1.  Similarity  matrix  computation 

2.  Pairwise-distance-based  clustering 

3.  Models  initialization  and  training 

4.  Decision  level  fusion 


These  steps  are  illustrated  in  the  block  diagram  in 
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figure  3,  and  are  described  in  the  next  sections. 


Figure  3.  Block  diagram  of  the  proposed  eHMM  (training) 
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3.3.1  Similarity  matrix  computation 


3. 3. 1.1  Fitting  individual  models  to  sequences 

In  the  first  step,  each  sequence  in  the  training  data,  Or,  1  <  r  <  R  is  used  to  learn  an  HMM 
model  Ar.  Even  though  the  use  of  only  one  sequence  of  observations  to  form  an  HMM  might  lead 
to  over-fitting,  this  step  is  only  an  intermediate  step  that  aims  to  capture  the  characteristics  of  each 
sequence.  The  formed  HMM  model  is  meant  to  give  a  maximal  description  of  each  sequence  and 
therefore,  over- fitting  is  not  an  issue  in  this  context.  In  fact,  it  is  desired  that  the  model  perfectly 
fits  the  observation  sequence.  In  this  case,  it  is  expected  that  the  likelihood  of  each  sequence  with 
respect  to  its  corresponding  model  is  higher  than  those  with  respect  to  the  remaining  models. 

Let  {Af°)}^=1  be  the  set  of  initial  models  and  1  <  n  <  TV,  be  the  representative  of  each 
state  in  A^.  First,  the  model  states  are  initialized  following  one  of  the  methods  described  in  section 
2.1.9.  In  most  cases,  domain  knowledge  will  be  used  to  initialize  the  model.  This  will  be  illustrated 
through  the  example  of  landmine  detection  in  section  3.3. 1.2. 

For  the  discrete  case,  the  codewords  {iq,  •  •  •  vm}  of  the  initial  individual  DHMM  model  are 
the  actual  observations  of  the  sequence  {O i,  •  •  •  Ot}-  Consequently,  the  emission  probability  of  each 
codeword  in  each  state  is  inversely  proportional  to  their  distance  to  the  mean  of  that  state,  i.e. , 

_ l _ 

bn(m)  =  lVm~Snl'  ,  1  <  m  <  M.  (69) 

5^  =  1  \\Vm-SlW 

To  satisfy  the  requirement  ^n{m)  =  1,  we  normalize  the  values  by: 

bn(m)< —  ,1  <m<M.  (70) 

E;=iM0 

In  the  continuous  case,  the  emission  probability  density  functions  are  modeled  by  mixtures 
of  Gaussians.  In  the  case  of  individual  sequence  models,  we  use  a  single  component  mixture  for  each 
state,  as  the  number  of  observations  is  small.  Thus,  the  observations  belonging  to  each  state  are 
used  to  estimate  the  mean  and  covariance  of  that  state’s  component. 

Then,  the  Baum- Welch  algorithm  [9]  is  used  to  fit  one  model  to  each  given  observation.  Let 
{Ar}^i  be  the  set  of  trained  individual  models. 

3. 3. 1.2  Illustrative  example:  individual  model  construction  step 

The  different  steps  of  the  eHMM  will  be  illustrated  using  a  landmine  detection  application 
with  GPR  data  and  EHD  features.  We  assume  that  each  signature  Or,  as  those  shown  in  figure  2,  is 
represented  by  a  sequence  of  T  =  15  observations  Or  =  {0^\  0^\  . . . ,  O^}.  Each  observation  is 
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a  five-dimensional  vector  that  encodes  the  strengths  of  the  edges  in  the  horizontal,  vertical,  diagonal, 
anti-diagonal,  and  non-edge  directions2,  i.e.  Or  ^  =  [Hrt\v^\  D^\  A^\  N^].  The  signatures  of 
the  mines  shown  in  figure  2  are  characterized  by  a  hyperbolic  shape  comprised  of  a  succession  of 
rising,  horizonal,  and  falling  edges  with  different  strengths  of  each  edge  and  variable  duration  in 
each  state.  Thus,  for  each  sequence,  we  build  a  three-state  HMM  model  that  reflects  our  knowledge 
of  the  mine  signature.  Specifically,  the  states  are  labeled  state  s±  or  diagonal  ”Dg”  state,  state  82 
or  horizontal  ”Hz”  state,  and  state  83  or  anti-diagonal  ”Ad”  state. 

First,  the  priors  are  initialized  such  that  mine  models  have  to  start  with  the  diagonal  state 
and  non  mine  models  can  start  in  any  of  the  states  with  equal  probabilities.  Thus,  the  priors  are 
initialized  as  follows: 


n  = 


n  = 


1 

0 

0 

.33 

.33 

.33 


if  sequence  is  mine 


(71) 


if  sequence  is  non-mine 


Next,  the  transition  matrix,  A,  is  initialized  using  prior  knowledge  (expected  latency  in  each 
state)  and  using  the  left-to-right  constraint.  We  use: 


A(°) 


0.8  0.2  0 

0  0.8  0.2 


0  0  1 


(72) 


To  enforce  the  left-to-right  constraint  and  the  prior  initialization  assumption  for  the  indi¬ 
vidual  mine  models,  we  discard  observations  with  weak  diagonal  (anti-diagonal)  component  at  the 
start  (end)  of  the  sequence.  In  other  words,  we  use  an  adaptive  sequence  length  by  defining  the  new 
start  and  end  of  the  sequence,  Tstart  and  Tend  using: 


/ 

Tstart  =  min{t\0(r\D)  >  r,t  =  1- ■ -T) 

S  (73) 

Tend  =  max{t\Or\A)  >  r,  t  =  1  •  •  •  T}. 

In  (73),  0^\d)  is  the  diagonal  edge  component  of  observation  vector  0^\  0^r\A)  is  the  anti¬ 
diagonal  edge  component  of  observation  vector  ,  and  r  is  a  predefined  threshold. 

2 The  details  of  the  edge  features  and  their  extraction  will  be  described  in  section  5.3 
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For  the  emission  probabilities  initialization,  the  first  step  consists  of  computing  the  mean  of 
each  state.  For  that  matter,  we  investigate  two  possible  initialization  schemes: 

•  Initialization  1:  This  initialization  scheme  applies  to  both  the  mine  and  the  non  mine  se¬ 
quence  models.  We  do  not  attempt  to  recognize  the  state  of  each  observation.  Instead,  we 

(r) 

assume  that  the  first  few  observations  belong  to  state  s\  ,  the  middle  few  observations  belong 
to  state  ,  and  the  last  few  observations  belong  to  state  .  We  simply  compute  the  mean 
of  the  three  states  as  follows:  the  mean  of  the  diagonal  state  ”Dg”,  is  the  average  of 
observations  ( OtSTart  '  *  *07);  the  mean  of  the  horizontal  state  ”Hz”,  s^\  is  the  average  of 
observations  (06  •  •  •  Oio);  and  the  mean  of  the  anti-diagonal  ”Ad”  state,  s^\  is  the  average  of 
observations  (O 9  •  •  •  Otend  )• 

•  Initialization  2:  This  initialization  scheme  differs  between  mine  and  non-mine  models.  For 
the  mine  model,  we  partition  its  observations  into  three  clusters  and  label  them  ”Dg”,  ”Ad”, 
and  ”Hz”.  The  ”Dg”  cluster  corresponds  to  the  partition  where  the  observations  have  stronger 
diagonal  edges.  Similarly,  the  ”  Ad”  cluster  corresponds  to  the  partition  where  the  observations 
have  stronger  anti-diagonal  edges.  The  ”Hz”  cluster  corresponds  to  the  partition  where  the 
observations  have  weak  diagonal  and  anti-diagonal  edge  components.  Once  each  observation 
is  assigned  to  one  of  the  three  clusters,  we  compute  the  means,  s\r\  as  the  average  of  all 
observations  assigned  to  cluster  i,  1  <  i  <  3.  For  the  non- mine  model,  the  means  of  the  states 
are  obtained  by  clustering  all  of  the  non  target  sequence  observations  into  three  clusters  using 
the  k- means  algorithm  [24]. 

In  the  second  step,  we  compute  the  initial  emission  probabilities  of  each  state.  For  the 
discrete  case,  the  codewords  for  the  individual  DHMM  model  are  the  actual  observations  of  the 
sequence  after  squeezing  {Otstart,  -  ’ '  ,0tejvd}.  Consequently,  the  emission  probability  of  each 
codeword  in  each  state  are  computed  using  equations  (69)  and  (70).  In  the  continuous  case,  the 
mean  and  covariance  of  the  emission  probability  density  function  for  each  state’s  component  are 
estimated  using  the  observations  belonging  to  that  state. 

3. 3. 1.3  Computing  the  similarity  matrix 

For  each  observation  sequence  Or,  1  <  r  <  R,  we  compute  its  likelihood  in  each  model  Ap, 
Pr(Or|Ap),  for  1  <  p  <  P,  using  either  the  forward  or  the  backward  procedure  described  in  section 
2.1.6.  Since  the  models  share  the  same  architecture,  it  could  be  asserted  that  similar  alarms  would 
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have  comparable  likelihood  values.  However,  the  likelihood  based  similarity  may  not  be  always 
accurate.  In  fact,  some  observations  can  have  high  likelihood  in  a  visually  different  observation 
model.  This  occurs  when  most  of  the  elements  of  a  sequence  partially  match  only  one  or  two  of  the 
states  of  the  model.  In  this  case,  the  observation  sequence  can  have  a  high  likelihood  in  the  model  but 
its  optimal  Viterbi  path  will  deviate  from  the  typical  path.  To  alleviate  this  problem,  we  introduce 
a  penalty  term,  P,  to  the  log-likelihood  measure  that  is  related  to  the  mismatch  between  the  most 
likely  sequence  of  hidden  states  of  the  test  sequence  (Oi)  and  that  of  the  generating  sequence  (Oj). 

Let  S  be  the  R  x  R  similarity  matrix  defined  by: 

s (i,j)  =  aL (i,j)  -  (1  -  a)P (i,j),  (74) 


where 

L(i,j)  =  log  Pr(Oi  |Aj)  (75) 


is  the  log-likelihood  of  observation  sequence  i  in  model  Ay ,  and 

p(*>i)  =  J2  llSo«o  -  8%)  ll%k)  ±  <£t3\  1  <  i,j  <  R, 

t= 1  qt  qt 


(76) 


is  a  penalty  term  that  measures  the  deviation  between  the  Viterbi  paths  of  testing  Oi  and  Oj  with 
model  Xj  (recall  that  model  A j  was  learned  using  observation  sequence  Oj).  In  (76),  q[^\  and 
are  respectively  the  most  likely  (as  identified  by  the  Viterbi  algorithm  [8])  hidden  state  sequences 
that  generated  sequences  Oi  and  Oj  from  the  model  Xj. 

The  first  term  in  equation  (74)  is  the  log-likelihood  of  sequence  Oi  being  generated  from 
model  Xj.  When  the  likelihood  term  is  high,  it  is  likely  that  model  Xj  generated  the  sequence  Oi.  In 
particular,  since  by  construction  model  Xj  parameters  were  adjusted  such  that  they  maximize  the 
likelihood  of  sequence  Oj  in  Xj ,  it  is  expected  that  the  likelihood  of  each  sequence  in  its  corresponding 
model  be  high.  In  this  case,  sequences  Oi  and  Oj  are  expected  to  share  the  same  ” dynamics”.  When 
the  likelihood  term  is  low,  it  is  unlikely  that  model  Xj  generated  the  sequence  Oi.  In  this  case,  it  is 
expected  that  Oi  and  Oj  are  not  similar. 

The  second  term  in  equation  (74)  is  a  penalty  term  denoted  as  the  Viterbi  path  mismatch. 
It  is  derived  from  the  Viterbi  optimal  paths  (refer  to  section  2.1.7).  Precisely,  an  entry  Pij  is 
the  distance  between  the  optimal  path  of  the  sequence  Oi  in  model  Xj  and  the  optimal  path  of 
the  sequence  Oj  in  Xj  using  an  appropriate  distance  measure  between  paths.  For  instance,  we 
can  use  the  distance  between  the  mean  of  the  states  appearing  in  each  path,  as  defined  in  (76). 
Another  distance,  commonly  used  in  ’’string”  comparisons,  is  the  ’’edit  distance”  [51].  The  ’’edit 
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(a)  Alarm  Sigi  (b)  Alarm  Sig2  (c)  Alarm  Sig% 

Figure  4.  GPR  signatures  of  mine  alarms 

distance”  between  two  strings,  say  p  and  g,  is  defined  as  the  minimum  number  of  single-character  edit 
operations  (deletions,  insertions,  and/or  replacements)  that  would  convert  p  into  q.  For  example,  the 
distance  between  p  =  ”1111223333”  and  q  =  ”111223333”  is  3.  The  Viterbi  path  mismatch  term  is 
intended  to  ensure  that  similar  sequences  have  few  mismatches  in  their  corresponding  Viterbi  paths. 
This  term  is  important  in  applications  where  the  HMM  states  have  a  ”  visual”  interpretation  or  are 
related  to  a  certain  shape  characteristic  in  the  original  sequence  raw  data  or  extracted  features.  The 
extra  computational  cost  for  obtaining  the  Viterbi  path  penalty  is  very  low  compared  to  that  of 
computing  the  traditional  likelihood.  Indeed,  the  Viterbi  path  is  already  available  when  using  the 
forward-backward  procedure  for  the  likelihood  computation. 

The  mixing  factor  a  E  [0, 1]  of  equation  (74)  is  a  trade-off  parameter  between  a  likelihood- 
based  similarity  and  a  Viterbi-path-mismatch  based  similarity.  It  could  be  either  set  according  to 
the  application  at  hand  or  optimized  using  the  available  training  data  and  an  adequate  criterion. 

The  matrix  computed  using  (74)  is  a  similarity  matrix  and  is  not  symmetric.  Thus,  we  use 
the  following  symmetrization  scheme  to  transform  it  to  a  pairwise  distance  matrix: 

1-  D (i,j)  =  -S (i,j)  1  <  i,j  <  R 

<  2.  D(M)=0,  l<i<R  (77) 

3.  D(i,  j)  m  max(D(i,  j),D(j,i)),  1  <i,j<R. 

The  matrix  D  represents  the  distance  or  dissimilarity  between  pairs  of  sequences  Oi  and  Oj. 

3. 3. 1.4  Illustrative  example:  similarity  matrix  computation  step 

Continuing  with  the  landmine  illustrative  example,  figures  4(a)  and  4(b)  show  two  typical 
mine  alarms,  Sig±  and  Sig2,  that  have  similar  GPR  signatures  and  similar  EHD  features.  Figure 
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4(c)  shows  a  mine  alarm,  Sigs ,  that  has  a  visually  different  signature  compared  to  Sigi  and  Sig2- 
Let  A i  be  the  trained  DHMM  model  for  alarm  Sigi  and  Oi  be  its  observation  vector,  for  i  =  1  •  •  •  3. 
The  initial  and  trained  DHMM  models  for  these  3  alarms  are  given  in  figures  5,6,  and  7  respectively. 

First,  to  check  the  validity  of  the  log-likelihood  measure,  we  compute  the  probabilities  of  0\ 
and  O2  in  their  respective  models  Ai  and  A2.  As  it  can  be  seen  in  table  1,  the  probability  of  each 
sequence  in  its  own  model  is  higher  than  its  probability  in  the  other  sequence’s  model.  Note  that 
since  Sigi  and  Sig2  are  similar,  the  likelihoods  L12  and  L21  are  relatively  high.  In  fact,  both  models 
Ai  and  A2  have  comparable  attributes  (codes,  states’  means,  and  transition  matrices)  as  shown  in 
figures  5  and  6.  On  the  other  hand,  since  Sig%  is  different  from  Sigi  and  Sig2 ,  the  likelihoods  L31 
and  L32  are  relatively  lower  (L31  =  —13.23  and  L32  =  —11.45). 


EHD  Observations  of  Alarm  Sigi ,  Oi 
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Figure  5.  DHMM  model  Ai  construction  for  Sigi.  (a)  EHD  features  of  the  sequence  of  15  obser¬ 
vations  0\.  T st art  and  Tend  denote  the  dynamic  range  of  the  sequence  and  are  computed  using 
equation  (73).  The  Viterbi  path  of  testing  Oi  with  Ai  is  displayed  under  each  observation  number, 
(b)  Initial  and  learned  (using  Baum- Welch  algorithm)  parameters  of  model  Ai. 


Table  2  shows  the  Viterbi  paths  resulting  from  the  test  of  each  alarm  Sigi,  Sig2 ,  and  Sig%  in 
each  model  Ai,  A2,  and  A3.  We  notice  that  the  Viterbi  path  mismatch  penalty  is  not  as  determinant 
as  the  log-likelihood  in  this  particular  illustration  test  case.  Nevertheless,  it  is  worth  noting  that 
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EHD  Observations  of  Alarm  Sig2,  Og 
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Figure  6.  DHMM  model  A2  construction  for  Sig2-  (a)  EHD  features  of  the  sequence  of  15  obser¬ 
vations  02-  Tstart  and  Tend  denote  the  dynamic  range  of  the  sequence  and  are  computed  using 
equation  (73).  The  Viterbi  path  of  testing  O2  with  A2  is  displayed  under  each  observation  number, 
(b)  Initial  and  learned  (using  Baum- Welch  algorithm)  parameters  of  model  A2. 

TABLE  1 

Log-likelihoods  of  the  three  signatures  in  figure  4  in  the  models  generated  by  these  signatures 


Tij 

Ai 

A2 

A3 

0 1 

-1.65 

-5.00 

-10.13 

02 

-5.50 

-1.55 

-7.55 

03 

-13.23 

-11.45 

-1.53 

the  path  pn  for  Sigi  in  its  model  shows  a  perfect  symmetry,  in  terms  of  diagonal  and  anti-diagonal 
states  occurrences.  However,  the  path  £>33  for  Sig%  in  its  model  reflects  the  visual  asymmetry  of 
Sigs  (stronger  diagonal  component). 

In  figure  8,  we  show  a  non-mine  (i.e.  clutter)  alarm,  Sig±.  Let  O 4  denote  its  observation 
vector  and  let  A4  be  its  DHMM  model.  The  construction  of  this  model  is  illustrated  in  figure  9.  As 
mentioned  in  section  3.3. 1.3,  the  motivation  for  using  the  path  mismatch  penalty  comes  from  the 
testing  of  non-mine  signatures,  such  as  Sig 4,  in  a  mine  model,  e.g.  Ai.  In  fact,  in  the  computation 
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EHD  Observations  of  Alarm  Sigg,  Og 


(a) 


(b) 

Figure  7.  DHMM  model  A3  construction  for  Sig 3.  (a)  EHD  features  of  the  sequence  of  15  obser¬ 
vations  O3.  T st art  and  Tend  denote  the  dynamic  range  of  the  sequence  and  are  computed  using 
equation  (73).  The  Viterbi  path  of  testing  O3  with  A3  is  displayed  under  each  observation  number, 
(b)  Initial  and  learned  (using  Baum- Welch  algorithm)  parameters  of  model  A3. 


TABLE  2 

Viterbi  paths  resulting  from  testing  each  alarm  in  figure  4  by  the  three  models  learned  from  these 
alarms,  pij  refers  to  the  optimal  path  obtained  by  testing  signature  i  with  model  j. 
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of  the  log-likelihood  L41,  most  or  all  of  the  non- mine  alarm  observations  are  assigned  to  one  single 
code  from  Ai  (typically,  the  one  with  weak  edges),  that  is  the  closest  code  to  all  the  observations  of 
Sig4.  Consequently,  most  of  the  observations  in  O4  are  assigned  to  the  state  whose  mean  is  closest 
to  that  code.  Hence,  the  probability  of  O4  in  Ai  is  high  (L41  =  —1.72)  and  its  Viterbi  path  P14,  as 
shown  in  table  3,  is  a  recurrence  of  state  si.  In  this  case,  the  path  mismatch  penalties,  derived  from 
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the  paths  shown  in  table  3,  can  be  easily  computed  using  the  ’’edit  distance”  (Pi  4  =  9  and  P41  =  6). 
Those  penalty  values  are  relatively  high  when  compared  to  the  previous  penalty  values  obtained  by 
testing  mine  alarms  in  mine  models  (Pi 2  =3,  P21  =4,  Pi 3  =3,  and  P31  =  2). 


Figure  8.  GPR  signature  for  a  non-mine  alarm,  Sig^ 


_ (a)  _ 
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Figure  9.  DHMM  model  A4  construction  for  Sig±.  (a)  EHD  features  of  the  sequence  of  15  observa¬ 
tions  O4.  (b)  Initial  and  learned  (using  Baum- Welch  algorithm)  parameters  of  model  A4. 
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TABLE  3 


Viterbi  paths  resulting  from  testing  Sigi  and  Sig±  in  models  Ai  and  A4.  pij  refers  to  the  optimal 
path  obtained  by  testing  signature  i  with  model  j. 
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Figures  10(a)  and  10(b)  show  the  log-likelihood  and  path  mismatch  penalty  matrices  for 
a  large  collection  of  mine  and  clutter  signatures.  In  these  figures,  the  indices  are  rearranged  so 
that  the  first  entries  correspond  to  the  mine  signatures  and  the  latter  ones  correspond  to  non¬ 
mine  signatures.  In  other  words,  the  diagonal  blocks  correspond  to  testing  mine  signatures  in  mine 
models  and  non- mine  signatures  in  non-mine  models,  and  the  off-diagonal  blocks  correspond  to 
testing  mine  signatures  in  non- mine  models  and  non- mine  signatures  in  mine  models.  In  these 
figures,  dark  pixels  correspond  to  small  values  of  the  log-likelihood  and  path  mismatch  penalty  and 
bright  pixels  correspond  to  larger  entries  of  the  corresponding  matrices.  Note  that  in  the  case  of  the 
log-likelihood  matrix  of  figure  10(a),  the  diagonal  blocks  are  brighter  than  the  off-diagonal  blocks. 
Conversely,  in  the  path  mismatch  penalty  matrix  of  figure  10(b)),  the  diagonal  blocks  are  darker 
than  the  off-diagonal  blocks.  Thus,  both  the  log-likelihood  and  the  path  mismatch  penalty  provide 
complementary  measures  for  the  (dis) similarity  between  signatures.  This  was  the  motivation  for 
combining  both  matrices  using  the  mixing  factor  a  in  (74). 

3.3.2  Pairwise-distance-based  clustering 

The  similarity  matrix,  computed  using  equation  (74)  and  illustrated  in  figure  10  for  the  case 
of  landmine  detection,  reflects  the  degree  to  which  pairwise  signatures  are  considered  similar.  The 
largest  variation  is  expected  to  be  between  classes.  Other  significant  variation  may  exist  within  the 
same  class,  e.g.  the  groups  of  signatures  shown  in  figure  2.  Our  goal  is  to  identify  the  similar  groups 
so  that  one  model  can  be  learned  for  each  group.  This  task  can  be  achieved  using  any  relational 
clustering  algorithm.  In  our  work,  we  investigate  three  different  clustering  algorithms  :  hierarchical 
[24],  spectral  with  learnable  cluster  dependent  kernels  [52],  and  fuzzy  clustering  with  multiple  kernels 

[53]. 
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10.  Log-likelihood  and  path  mismatch  penalty  matrices  for  a  large  collection  of  mine 
signatures 


M 

,14 


and 


3.3.2. 1  Hierarchical  Clustering 


In  this  case,  the  natural  choice  is  to  use  an  agglomerative  hierarchical  algorithm  [24].  Ag- 
glomerative  hierarchical  clustering  is  a  bottom-up  approach  that  starts  with  each  data  point  as  a 
cluster.  Then,  it  proceeds  by  merging  the  most  similar  clusters  to  produce  a  sequence  of  clusters. 
Several  similarity  measures  between  clusters  have  been  used  [24].  Examples  include  single  link, 
complete  link,  average  link,  and  ward  distance.  In  the  single  link  method,  the  distance  between 
two  clusters  is  the  minimum  of  the  distances  between  all  pairs  of  sequences  drawn  from  the  two 
clusters.  In  the  complete  link  based  algorithm,  the  distance  between  two  clusters  is  the  maximum  of 
all  pairwise  distances  between  sequences  in  the  two  clusters.  Complete  link  method  tends  to  produce 
compact  clusters,  while  single  link  is  known  to  result  in  ”  elongated”  clusters.  Another  interesting 
similarity  measure  between  two  clusters  is  the  minimum- variance  distance,  or  ward  distance  [54]. 
This  distance  is  defined  as 


d(i,j) 


riiUj 

rti  +  rij 


Ci 


2 


(78) 


where  and  Ck  are  respectively  the  cardinality  and  the  centroid  of  the  cluster  Ck-  It  has  been 
shown  in  [55]  that  this  approach  merges  the  two  clusters  that  lead  to  the  smallest  increase  in  the 
overall  variance. 
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3. 3. 2. 2  Spectral  Clustering  with  Learnable  Cluster  Dependent  Kernels  (FLeCK) 

FLeCK  [52]  is  a  kernel  clustering  algorithm  that  works  on  both  object  based  representation 
and  relational  data.  It  learns  multiple  kernels  by  simultaneously  optimizing  intra-cluster  and  inter¬ 


cluster  distances.  Specifically,  it  tries  to  simultaneously  minimize: 

K  N  N  K 

ri  k 


=  EEE4(>-“p(-f))-E5 

i=l  j= 1  k= 1  1 


i=  1 


and  maximize 


K  N  N 


J 


inter 


i=  1  j= 1  k=  1  1  i=  1  1 


(79) 


(80) 


In  (79)  and  (80),  r  is  a  distance  matrix  representing  the  degree  to  which  objects  from  a 
dataset  {x\,...xn}  are  dissimilar;  ai  is  the  scaling  parameter  of  cluster  i;  K  is  the  number  of 
clusters;  p  is  a  regularization  constant;  /3jk  is  the  likelihood  that  two  points  Xj  and  x &  belong  to 
the  same  cluster;  and  al-k  is  the  likelihood  that  two  points  Xj  and  Xk  belong  to  different  clusters. 
Precisely, 


P)k=<j<k 


(81) 


xjk  ~  “ij  y 


where  m  G  (1,  oo)  is  a  constant  that  determines  the  level  of  fuzziness,  and  Uij  is  the  fuzzy  membership 


of  Xj  in  cluster  i  and  satisfies 


K 


0  <  <  1  and  =  1 ,  for  j  —  1 , 


,  N. 


i= 1 


The  first  term  in  (79)  seeks  clusters  that  have  compact  local  relational  distances 

D  )k  =  1  -exp(-^) 

CTi 


(82) 


(83) 


with  respect  to  each  cluster  i. 

Assuming  that  the  scaling  parameters  af  s  are  independent  from  each  other,  it  can  be  shown 
[52]  that  optimization  of  (79)  and  (80)  can  be  performed  iteratively  using 

Ef=iEf=i/3jfcr  *jkexp(--&v) 


aP  = 


EL  E£=i  a)krjkexp(-J^)  +  E'E  E*=i  Kkrjkexp(-J^) 


N  N 


and 


^ J  7 S’  r/2  ^ 

E  t=i(^)^ 
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(85) 


where  dfk  is  the  Euclidean  distance  from  feature  x/;:  to  cluster  i  that  can  be  written  [56]  in  terms  of 
the  relational  matrix  D!  as 


4  =  (DWOfe  - 


v‘D‘v, 


(86) 
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In  (86),  Vi  is  the  membership  vector  of  all  N  samples  in  cluster  i  defined  by 


(«s>- 


5  uiN 


Ei=1< 


(87) 


<7  =  1  *7 

Starting  with  an  initial  partition,  FLeCK  iteratively  updates  the  fuzzy  membership  variables  /3jk 
and  Oil-k  and  the  scaling  parameters  cq.  The  FLeCK  algorithm  steps  are  summarized  in  Algorithm 
4. 


Algorithm  4  The  FLeCK  algorithm,  adapted  from  [52] 

Require: 

l:  Symmetric  distance,  r 

2:  Number  of  clusters  K,  fuzzifier  m,  and  scaling  parameters 

Ensure: 

3:  repeat 

4:  Compute  the  distance  Dl  for  all  clusters  using  (83); 

5:  Compute  the  membership  vectors  using  (87); 

6:  Compute  the  distances  using  (86); 

7:  Update  the  fuzzy  memberships  using  (85); 

8:  Update  the  scaling  parameter  cq  using  (84); 

9:  until  Fuzzy  memberships  do  not  change 


3. 3. 2. 3  Fuzzy  Clustering  with  multiple  kernels  (FCM-MK) 

The  FCM-MK  [53]  consists  of  reducing  the  dimensionality  of  the  original  distance  matrix 
and  performing  fuzzy  c-means  clustering  [57]  on  the  reduced  data.  The  steps  for  FCM-MK  are 
detailed  in  Algorithm  5. 


Algorithm  5  The  FCM-MK  algorithm,  adapted  from  [53] 

Require:  symmetric  distance  matrix,  D 
Require:  ’’neighborhood”  size 

Ensure: 

1:  Compute  gk  matrix  using  the  ’’neighborhood”  parameter 
2:  Compute  Kernel  Matrix  K  as  K  =  exp(—((D2)/aK)) 

3:  Normalize  K 

4:  Perform  singular  value  decomposition  (svd)  on  normalized  kernel  K 
5:  Derive  the  corresponding  mapping  using  the  corresponding  feature  reduction  dimension 
6:  Perform  fuzzy  c-means  clustering  on  the  reduced  dimension  space  mapping 
7:  Derive  the  corresponding  partition 


3. 3. 2. 4  Illustrative  example:  hierarchical  clustering  step 

Continuing  with  the  illustrative  example  of  landmine  detection,  we  use  the  standard  hier¬ 
archical  clustering  algorithm  to  cluster  the  distance  matrix  D  computed  from  the  similarity  matrix 
S  of  (74),  using  (77).  The  log-likelihood  and  path  mismatch  penalty  terms  of  S  for  a  collection  of 
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(a)  Distance  matrix 


Figure  11.  Hierarchical  clustering  results  of  the  matrix  D  for  a  large  collection  of  mine  and  clutter 
signatures 


mine  and  clutter  signatures  were  shown  in  figure  10.  We  cluster  D  into  a  pre-set  number  of  clusters, 
K  =  20.  The  choice  of  K  stems  from  prior  knowledge  of  the  training  data  at  hand.  In  particular,  the 
choice  of  K  should  depend  on  the  number  of  mine  types,  the  clutter  categories,  and  burial  depths 
observed  in  the  data  collection.  In  Figure  11(a),  we  plot  the  resulting  distance  matrix  (that  com¬ 
bines  figure  10(a)  and  figure  10(b))  as  a  gray-scale  intensity  image.  We  use  the  VAT  algorithm  [58] 
to  rearrange  the  matrix  entries  so  that  each  diagonal  block  corresponds  to  a  cluster.  The  diagonal 
blocks  of  the  matrix  are  darker,  which  corresponds  to  smaller  intra-cluster  distances. 

Similarly,  the  dendrogram  in  figure  11(b)  shows  that,  at  a  certain  threshold,  we  can  identify 
two  main  groups  of  clusters.  On  the  left  hand  side  of  the  dendrogram,  clusters  {3, 17,  •  •  •  ,  13}  have 
mainly  clutter  signatures.  On  the  right  hand  side,  clusters  {1,5,- ••  ,15}  are  dominated  by  mine 
signatures.  These  findings  are  further  highlighted  in  figure  12  where  some  clusters  (e.g.  cluster  1, 
5,  and  15)  have  a  large  number  of  mines  and  few  or  no  clutter  alarms.  These  are  typically  strong 
clutter  alarms  with  mine-like  signatures.  Figure  13  shows  sample  alarms  belonging  to  cluster  5. 

Similarly,  some  clusters  are  dominated  by  clutter  with  few  mines  (e.g.  cluster  2,  13,  and 
18).  The  few  mines  included  in  these  clusters,  as  shown  in  figure  14,  are  typically  mines  with  weak 
signatures  that  are  either  low  metal  mines  or  mines  buried  at  deep  depths.  Other  clusters  are 
composed  of  a  mixture  of  mines  and  clutter  signatures  (e.g.  cluster  3,  6,  and  11).  The  mines  within 
these  clusters  are  either  low  metal  mines  (Figure  12  (b),  cluster  6)  or  mines  buried  at  deep  depths 
(Figure  12  (c),  clusters  3  and  11).  Sample  alarms  belonging  to  cluster  11  are  shown  in  figure  15. 

The  above  figures  illustrate  the  purpose  of  step  2  of  the  proposed  eDHMM  approach,  i.e. 
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(c)  Distribution  of  the  mine  signatures  in  each  cluster  per  burial  depth 


Figure  12.  Distribution  of  the  alarms  in  each  cluster:  (a)  per  class,  (b)  per  type,  (c)  per  depth, 
grouping  similar  signatures  into  clusters. 

To  summarize,  using  the  similarity  matrix  D  of  (77),  the  R  sequences  are  clustered  into  K 
clusters.  The  objective  of  this  clustering  step  is  to  group  similar  observations  in  the  log-likelihood 
space.  The  observations  forming  each  cluster  are  then  used  to  learn  multiple  HMM  models.  The 
initialization  of  each  model  is  performed  by  averaging  the  parameters  of  the  individual  models  of 
the  sequences  belonging  to  that  cluster.  Then,  for  each  cluster,  the  initial  model  and  the  sequences 
belonging  to  that  cluster  are  used  to  train  and  update  the  HMM  model  parameters.  In  the  next 
step,  we  show  how  different  HMM  models  can  be  learned  for  the  different  clusters. 
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Figure  13.  Sample  signatures  belonging  to  a  cluster  dominated  by  strong  mines  (cluster  5).  Above 
each  signature,  ”M”  denotes  a  mine  signature  and  the  number  refers  to  the  burial  depth. 
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Figure  14.  Sample  signatures  belonging  to  a  cluster  dominated  by  weak  signatures  (cluster  18). 
Above  each  signature,  ”M”  denotes  a  mine  signature,  ”C”  denotes  a  clutter  signature,  and  the 
number  refers  to  the  burial  depth. 
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Figure  15.  Sample  signatures  belonging  to  a  cluster  that  has  a  mixture  of  weak  mines  and  clutter 
signatures  (cluster  11).  Above  each  signature,  ”M”  denotes  a  mine  signature,  ”C”  denotes  a  clutter 
signature,  and  the  number  refers  to  the  burial  depth. 
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3.3.3  Initialization  and  training  of  the  ensemble  HMM 


The  previous  clustering  step  results  in  K  clusters,  each  comprised  of  Nk  potentially  similar 
sequences.  These  K  clusters  will  be  used  to  learn  the  ensemble  HMMs.  Let  denote  the  number 
of  sequences  assigned  to  the  same  cluster  k  and  labeled  as  class  c,  such  that 

c 

0  <  N{kc)  <  Nk,  and  ^  N{kc)  =  Nk.  (88) 

C=1 

For  instance,  for  the  landmine  example,  if  we  let  c  =  1  denote  the  class  of  mines  and  c  =  0  denote 
the  class  of  clutter,  would  be  the  number  of  mines  assigned  to  cluster  k.  The  goal  of  this  step 
is  to  initialize  up  to  C  HMM  models  {A^}  for  each  of  the  K  clusters.  Let  ©j^  =  {Or°\  Vr^  }^=i  be 
the  set  of  sequences  of  class  c  belonging  to  cluster  k  and  {Ar  }r=i  be  the  corresponding  individual 
sequence  models,  c  G  {1,  •  •  •  ,  C}. 

For  each  cluster,  we  devise  the  following  optimized  training  methods  based  on  the  cluster’s 
size  and  homogeneity. 

•  Clusters  dominated  by  sequences  from  only  one  class:  In  this  case,  we  learn  only  one 
model  for  this  cluster.  The  sequences  within  this  cluster  are  presumably  similar  and  belong 
to  the  same  ground  truth  class,  denoted  CV  We  assume  that  this  cluster  is  a  representative  of 
that  particular  dominating  class.  It  is  expected  that  the  class  conditional  posterior  probability 
is  uni- modal  and  peaked  around  the  MLE  of  the  parameters.  Thus,  a  maximum  likelihood 
estimation  would  result  in  an  HMM  model  parameters  that  best  fit  this  particular  class.  For 
these  reasons,  we  use  the  standard  Baum- Welch  re-estimation  procedure  [9].  Let  K\  be  the 
number  of  homogenous  clusters  and  {A^  *  ,  %  =  1,  •  •  •  K i}  denote  the  set  of  BW-trained  models. 

•  Clusters  with  a  mixture  of  observations  belonging  to  different  classes:  In  this  case, 
we  learn  one  model  for  each  class  represented  in  this  cluster.  It  is  expected  that  the  posterior 
distribution  of  the  classes  is  multimodal.  The  MLE  approach  is  not  adequate  and  more  ad¬ 
vanced  techniques  are  needed  to  address  the  multimodality,  such  as  genetic  algorithms  [59] ,  or 
simulated  annealing  optimization  [60].  In  our  work,  we  build  a  model  for  each  class  within  the 
cluster.  The  focus  is  on  finding  the  class  boundaries  within  the  posteriors  rather  than  trying  to 
approximate  a  joint  posterior  probability.  Thus,  the  models  parameters  are  jointly  optimized 
such  that  the  overall  misclassification  error  is  minimized.  In  this  context,  we  use  discriminative 
training  based  on  minimizing  the  misclassification  error  to  learn  a  model  for  each  class  [11]. 
The  details  of  the  discriminative  training  based  on  the  MCE  criterion  are  given  in  section 
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2. 1.8.2.  Let  K2  be  the  number  of  mixed  clusters  and  {^f\j  =  1,  •  •  •  K2,  c  =  1,  •  •  •  C}  be  the 
set  of  MCE-trained  models. 

•  Clusters  containing  a  small  number  of  sequences:  In  this  case,  we  learn  only  one  model 
for  the  dominating  class,  denoted  Cfc,  of  each  cluster.  The  MLE  and  MCE  approaches  need 
a  large  number  of  data  points  to  give  good  estimates  of  the  model  parameters.  Thus,  when 
a  cluster  has  few  samples,  the  above  approaches  may  not  be  reliable.  The  Bayesian  training 
framework  [25]  ,  on  the  other  hand  is  suitable  for  clusters  with  small  number  of  sequences. 
These  clusters  may  contain  information  about  sequences  with  distinctive  characteristics.  In  this 
case,  we  use  a  variational  Bayesian  approach  [29]  to  approximate  the  class  conditional  posterior 
distribution.  First,  the  models  parameters  are  given  priors.  Then,  parameter  posteriors  are 
computed  given  the  available  data.  Finally,  the  required  probabilities  are  derived  by  integrating 
over  the  parameters  posteriors.  Let  K%  be  the  number  of  small  clusters  and  {Aj^  k  ,  fc  = 
1,  •  •  •  Ks}  denote  the  set  of  Bayesian-trained  models. 

(c) 

Therefore,  for  each  homogenous  cluster  i,  we  define  one  model  A)  ,  %  =  1,  •  •  •  ,i^i,  for 
the  dominating  class  C{.  For  mixed  clusters,  we  define  C  models  per  cluster:  X^c\  c  =  1 . . .  C, 
j  =  1,  •  •  •  ,  K2.  For  each  small  cluster,  we  define  one  model  for  the  dominating  class  C&.  The 

ensemble  HMM  mixture  is  defined  as  { a£c^ } ,  where  k  E  {1,  •  •  •  ,  IT},  and  c  =  Ck  if  cluster  k  is 
dominated  by  sequences  labeled  with  class  C&,  and  c  E  {1  •  •  •  ,  C}  if  cluster  k  is  a  mixed  cluster. 

We  assume  that  all  models  A^  have  a  fixed  number  of  states  N.  For  each  model  A^c\  the 
initialization  step  consists  of  assigning  the  priors,  the  initial  states  transition  probabilities,  and  the 
states  parameters  (initial  means  and  initial  emission  probabilities)  using  the  observations  O ^  and 
their  respective  individual  models  r  E  {1,  •  •  •  ,  7V^}.  In  particular,  the  initial  values  for  the 
priors  and  the  state  transition  probabilities  are  obtained  by  averaging,  respectively,  the  priors  and 
the  state  transition  probabilities  of  the  individual  models  A i°\r  E  {1,  •  •  •  ,7V^}.  The  initialization 
of  the  emission  probabilities  in  each  state,  bn’  ,  depends  on  whether  the  HMM  is  discrete  or 
continuous. 

•  Discrete  HMM  (DHMM):  the  state  representatives  and  the  codebook  of  model  A^  are  ob¬ 
tained  by  partitioning  and  quantizing  the  observations  ©j^.  Any  clustering  algorithm,  such 
as  the  k- means  [61]  or  the  fuzzy  c- means  [62],  could  be  used  for  this  task.  First,  the  class- (c) 
sequences  belonging  to  cluster  &,  Or°\  are  ” unrolled”  to  form  a  vector  of  observations  U^,c^ 
of  length  N^T.  The  state  representatives,  Sn,c^’  s,  are  obtained  by  clustering  U^,c^  into  N 
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clusters  and  taking  the  centroid  of  each  cluster  as  the  state  representative.  Similarly,  the 
codebook  V^,c^  =  [v[k,c\  •  •  •  ,  is  obtained  by  clustering  into  M  clusters.  For  each 

symbol  Vm,c\  the  membership  in  each  state  Sn,c^  is  computed  using: 


n7,(fc’c)_dfe’c)  i, 

bn(m)  =  - S"1  ''  ,  1  <  m  <  M. 


EiV 
1  =  1 


To  satisfy  the  requirement  X]m=i  &n(^)  =  G  we  scale  the  values  by: 


(89) 


b^\m) 


(m) 


Yd=l  bn’C\l) 


(90) 


•  Continuous  HMM  (CHMM):  we  assume  that  each  state  has  Ng  Gaussian  components.  For 
each  model  the  observation  sequences  Or°\  r  G  {1,  •  •  •  ,  JV^},  are  assigned  to  one  of 
the  states  and  unrolled  to  form  a  vector  of  observations  U nk,c\  for  each  state  Snk,c\ 

n  =  1,  •  •  •  ,  N.  The  mean  and  covariance  of  component  g  of  state  are  computed  by 

clustering  into  Ng  clusters  using  the  k- means  algorithm  [23].  Precisely,  the  mean  of 

each  component  is  the  center  of  one  the  resulting  clusters  and  the  covariance  is  estimated 
using  the  observations  belonging  to  that  same  cluster. 


After  initialization,  we  use  one  of  the  training  schemes  described  earlier,  to  update 
parameters  using  the  respective  observations  ©^c\  k  G  {1,  •  •  •  ,iT},  cG  {1,  •  •  *  ,C}.  To  summarize, 
for  homogenous  clusters,  BW  training  results  in  one  model  \BW  per  cluster.  For  mixed  clusters, 
MCE  training  results  in  C  models  per  cluster:  \^fCE,c  =  For  small  clusters,  variational 

Bayesian  learning  results  in  one  model  per  cluster,  XVB . 


3.3.3. 1  Illustrative  example:  cluster  models  initialization  and  training  steps 


Referring  back  to  the  landmine  example,  the  clustering  of  the  training  data  resulted  in  20 
clusters.  As  it  was  shown  in  figure  12,  eight  clusters  (1,  4,  5,  7,  8,  15,  16,  and  17)  were  composed 
of  only  mine  alarms,  two  clusters  (3  and  20)  contain  exclusively  clutter  alarms,  and  the  remaining 
ten  clusters  have  both  mine  and  clutter  alarms.  Therefore,  using  the  notations  of  Algorithm  6,  we 
define  our  eHMM  as 


X£UV  rX(M)  X(M)  ,(M)  X(M)  X(M)  X(M)  X(M)  X(M)  X(C)  X(C), 
Ai  —  lAl  ?  A4  ’  A7  ’  A8  5  A15  ^  a16  5  A17  ’  A13  ’  A20  /> 

Af 1 CE  =  {A(.M),  A f  )},j  G  {2, 3, 6, 9, 10, 11, 12, 14, 16, 18, 19}, 


(91) 


AtB=' 
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Depending  on  the  cluster  type,  one  or  two  HMM  models  are  built  based  on  the  elements 
belonging  to  that  cluster:  one  for  mines  and/or  one  for  clutter.  The  initialization  of  these  models  is 
performed  as  follows.  The  priors  and  transition  matrices  of  model  {Aj^}  and  {Aj^}  are  obtained  by 
averaging  the  corresponding  parameters  of  the  individual  models  of  the  mine  and  clutter  signatures 
belonging  to  cluster  fc,  respectively.  We  assume  that  each  model  has  N  =  3  states.  The  initialization 
of  the  emission  probabilities  for  the  discrete  case  is  straightforward.  That  is,  the  means  of  the  states 
sn ,  n  —  1 ,  •  •  •  ,7V,  are  obtained  by  clustering  the  codes  of  the  individual  models  into  N  =  3  clusters 
using  the  k- means  algorithm  [23].  Similarly,  the  codes  vm,  m  =  1,  •  •  •  ,  M,  of  the  cluster  model  are 
obtained  by  clustering  the  codes  of  the  individual  models  into  M  —  20  clusters. 


TABLE  4 

DHMM  model  parameters  of  cluster  5 


Means  Initial  A  A 


H 

V 

D 

A 

N 

si 

^2 

53 

5l 

52 

53 

Si 

0.28 

0.15 

0.31 

0.10 

0.16 

0.67 

0.33 

0.00 

0.59 

0.41 

0.00 

S2 

0.48 

0.10 

0.12 

0.13 

0.17 

0.00 

0.69 

0.31 

0.00 

0.26 

0.74 

S3 

0.27 

0.15 

0.09 

0.34 

0.16 

0.00 

0.00 

1.00 

0.00 

0.00 

1.00 

Codes 

Initial  B 

B 

n 

0.21 

0.23 

0.09 

0.34 

0.14 

0.00 

0.00 

0.22 

0.00 

0.00 

0.12 

V2 

0.36 

0.01 

0.10 

0.04 

0.39 

0.03 

0.00 

0.01 

0.04 

0.09 

0.03 

V3 

0.26 

0.00 

0.06 

0.00 

0.67 

0.01 

0.00 

0.01 

0.05 

0.07 

0.03 

V4 

0.57 

0.01 

0.06 

0.07 

0.23 

0.00 

0.11 

0.02 

0.02 

0.15 

0.01 

V5 

0.09 

0.53 

0.04 

0.26 

0.13 

0.00 

0.00 

0.02 

0.00 

0.00 

0.12 

0.47 

0.09 

0.21 

0.10 

0.11 

0.03 

0.19 

0.00 

0.03 

0.14 

0.01 

V7 

0.36 

0.17 

0.33 

0.11 

0.07 

0.18 

0.03 

0.00 

0.15 

0.00 

0.00 

V8 

0.38 

0.09 

0.11 

0.26 

0.09 

0.01 

0.16 

0.26 

0.00 

0.00 

0.12 

Vg 

0.00 

0.56 

0.01 

0.11 

0.31 

0.00 

0.00 

0.00 

0.00 

0.00 

0.12 

vio 

0.69 

0.03 

0.09 

0.06 

0.09 

0.00 

0.20 

0.00 

0.03 

0.13 

0.02 

Vll 

0.27 

0.17 

0.27 

0.13 

0.16 

0.17 

0.01 

0.00 

0.14 

0.01 

0.00 

Vl2 

0.57 

0.06 

0.19 

0.11 

0.04 

0.00 

0.28 

0.00 

0.03 

0.14 

0.02 

Vl3 

0.20 

0.13 

0.24 

0.10 

0.31 

0.11 

0.00 

0.01 

0.09 

0.04 

0.02 

via 

0.09 

0.23 

0.23 

0.07 

0.37 

0.08 

0.00 

0.00 

0.07 

0.04 

0.03 

Vl5 

0.14 

0.14 

0.11 

0.10 

0.50 

0.03 

0.00 

0.08 

0.06 

0.06 

0.04 

Vl6 

0.14 

0.11 

0.06 

0.47 

0.11 

0.00 

0.00 

0.19 

0.00 

0.00 

0.12 

Vl7 

0.19 

0.14 

0.41 

0.07 

0.13 

0.19 

0.00 

0.00 

0.15 

0.00 

0.00 

VlS 

0.10 

0.29 

0.04 

0.33 

0.33 

0.00 

0.00 

0.16 

0.00 

0.00 

0.12 

V19 

0.14 

0.26 

0.30 

0.05 

0.23 

0.12 

0.00 

0.00 

0.10 

0.03 

0.02 

V20 

0.39 

0.09 

0.21 

0.04 

0.26 

0.05 

0.01 

0.00 

0.06 

0.09 

0.01 

After  initialization,  and  depending  on  the  cluster  size  and  homogeneity,  one  of  the  training 
schemes  described  earlier,  is  devised.  In  table  4,  we  show  the  initial  and  the  BW-trained  model  of 
cluster  5.  Recall  that  cluster  5  contains  a  large  number  of  strong  mines  and  few  clutter  alarms  with 
mine-like  signatures,  as  it  was  shown  in  figure  13.  Therefore,  model  A^  is  expected  to  reflect  the 
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TABLE  5 


DHMM  model  parameters  of  cluster  2 


Means  Initial  A  A 


H 

V 

D 

A 

N 

si 

52 

53 

Si 

52 

53 

Si 

0.18 

0.11 

0.22 

0.09 

0.39 

0.78 

0.22 

0.00 

0.82 

0.18 

0.00 

52 

0.14 

0.17 

0.08 

0.10 

0.51 

0.00 

0.00 

1.00 

0.00 

0.00 

1.00 

53 

0.17 

0.15 

0.03 

0.27 

0.38 

0.00 

0.00 

1.00 

0.00 

0.00 

1.00 

Codes 

Initial  B 

B 

Vl 

0.21 

0.10 

0.27 

0.07 

0.34 

0.11 

0.00 

0.00 

0.20 

0.00 

0.00 

V2 

0.01 

0.09 

0.09 

0.11 

0.70 

0.00 

0.00 

0.07 

0.00 

0.00 

0.08 

V3 

0.03 

0.27 

0.00 

0.10 

0.60 

0.00 

0.00 

0.07 

0.00 

0.00 

0.08 

v4 

0.06 

0.44 

0.04 

0.14 

0.31 

0.11 

0.00 

0.00 

0.00 

0.00 

0.08 

0.09 

0.39 

0.03 

0.10 

0.40 

0.11 

0.00 

0.00 

0.00 

0.00 

0.08 

V6 

0.04 

0.07 

0.06 

0.19 

0.64 

0.00 

0.00 

0.07 

0.00 

0.00 

0.08 

v7 

0.26 

0.17 

0.06 

0.16 

0.36 

0.00 

0.00 

0.07 

0.00 

0.00 

0.08 

V8 

0.09 

0.31 

0.02 

0.28 

0.30 

0.00 

0.00 

0.13 

0.00 

0.00 

0.08 

V9 

0.26 

0.09 

0.04 

0.21 

0.40 

0.00 

0.00 

0.07 

0.00 

0.00 

0.08 

V10 

0.04 

0.13 

0.14 

0.06 

0.63 

0.11 

0.00 

0.00 

0.05 

0.21 

0.01 

Vll 

0.24 

0.01 

0.00 

0.03 

0.71 

0.11 

0.00 

0.07 

0.05 

0.17 

0.02 

V12 

0.21 

0.09 

0.40 

0.01 

0.29 

0.11 

0.00 

0.00 

0.20 

0.00 

0.00 

Vl3 

0.30 

0.01 

0.36 

0.03 

0.30 

0.11 

0.00 

0.00 

0.20 

0.00 

0.00 

U14 

0.50 

0.03 

0.11 

0.06 

0.30 

0.00 

0.50 

0.00 

0.09 

0.09 

0.02 

^15 

0.09 

0.07 

0.19 

0.11 

0.54 

0.00 

0.00 

0.07 

0.07 

0.17 

0.01 

Vl6 

0.07 

0.31 

0.10 

0.09 

0.43 

0.00 

0.50 

0.00 

0.05 

0.19 

0.02 

Vl7 

0.09 

0.11 

0.01 

0.39 

0.40 

0.00 

0.00 

0.13 

0.00 

0.00 

0.08 

vis 

0.13 

0.22 

0.15 

0.09 

0.41 

0.11 

0.00 

0.07 

0.08 

0.16 

0.01 

V19 

0.11 

0.17 

0.09 

0.13 

0.49 

0.11 

0.00 

0.13 

0.00 

0.00 

0.08 

V20 

0.09 

0.24 

0.00 

0.27 

0.40 

0.00 

0.00 

0.07 

0.00 

0.00 

0.08 

typical  strong  mines  signatures  characteristics,  i.e.  hyperbolic  shape  of  the  signature  with  succession 
of  Dg-Hz-Ad  states.  This  can  be  corroborated  by  the  results  reported  in  table  4: 

1.  The  means  of  states  s i,  52,  and  53  (i.e.  the  Dg ,  Hz ,  and  Ad  state)  have  the  highest  component 
in  the  D,  iL,  and  A  dimension,  respectively. 

2.  In  the  initial  model,  the  matrix  A  is  nearly  symmetric  with  regards  to  the  latency  in  states 
St  and  52  (with  approximately  2/3  probability)  versus  the  transition  to  the  consequent  state 
(with  ~  1/3  probability).  After  training,  however,  the  model  is  less  likely  to  stay  in  52  (0.26 
probability)  and  tends  to  transition  rapidly  (0.74  probability)  to  53.  This  can  be  explained  by 
the  strength  of  the  diagonal  and  anti-diagonal  edges  of  the  mine  signatures  in  cluster  5  (refer 
to  figure  13),  compared  to  the  horizontal  edge. 

3.  Codes  with  high  emission  probability  B  in  state  si  (61(^7),  b(^n),  £>1(^13),  61(^17),  and  61(^19)) 
have  strong  D  component.  The  same  remark  is  valid  for  codes  with  high  emission  probability 
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TABLE  6 


A<f  DHMM  model  parameters  of  cluster  2 


Means  Initial  A  A 


H 

V 

D 

A 

N 

si 

52 

53 

si 

52 

53 

Si 

0.14 

0.02 

0.04 

0.04 

0.75 

0.95 

0.05 

0.00 

0.76 

0.24 

0.00 

52 

0.17 

0.03 

0.05 

0.06 

0.69 

0.00 

0.85 

0.15 

0.00 

0.31 

0.69 

53 

0.14 

0.02 

0.03 

0.05 

0.75 

0.00 

0.00 

1.00 

0.00 

0.00 

1.00 

Codes 

Initial  B 

B 

Vl 

0.17 

0.01 

0.00 

0.39 

0.41 

0.00 

0.00 

0.02 

0.00 

0.00 

0.13 

V2 

0.61 

0.00 

0.00 

0.00 

0.36 

0.00 

0.06 

0.00 

0.03 

0.13 

0.04 

V3 

0.01 

0.00 

0.00 

0.00 

0.91 

0.09 

0.00 

0.55 

0.04 

0.08 

0.05 

v4 

0.23 

0.13 

0.17 

0.24 

0.23 

0.00 

0.00 

0.02 

0.00 

0.00 

0.13 

0.10 

0.07 

0.07 

0.06 

0.67 

0.13 

0.03 

0.03 

0.10 

0.00 

0.00 

V6 

0.31 

0.03 

0.03 

0.16 

0.47 

0.01 

0.00 

0.02 

0.00 

0.00 

0.13 

v7 

0.16 

0.00 

0.01 

0.01 

0.76 

0.36 

0.06 

0.10 

0.05 

0.05 

0.06 

V8 

0.06 

0.01 

0.06 

0.01 

0.81 

0.17 

0.00 

0.15 

0.10 

0.00 

0.00 

V9 

0.20 

0.17 

0.06 

0.24 

0.33 

0.00 

0.00 

0.03 

0.00 

0.00 

0.13 

V10 

0.29 

0.00 

0.01 

0.01 

0.66 

0.14 

0.35 

0.02 

0.03 

0.16 

0.04 

Vll 

0.19 

0.14 

0.11 

0.10 

0.43 

0.01 

0.08 

0.01 

0.10 

0.00 

0.00 

V12 

0.24 

0.04 

0.08 

0.08 

0.56 

0.03 

0.16 

0.00 

0.03 

0.17 

0.03 

Vl3 

0.36 

0.04 

0.14 

0.14 

0.31 

0.01 

0.03 

0.02 

0.03 

0.14 

0.04 

U14 

0.11 

0.24 

0.34 

0.01 

0.20 

0.01 

0.00 

0.00 

0.10 

0.00 

0.00 

Vl5 

0.14 

0.06 

0.64 

0.03 

0.13 

0.00 

0.00 

0.00 

0.10 

0.00 

0.00 

Vl6 

0.46 

0.00 

0.00 

0.01 

0.51 

0.00 

0.08 

0.00 

0.03 

0.14 

0.04 

Vl7 

0.43 

0.01 

0.06 

0.06 

0.46 

0.00 

0.09 

0.01 

0.03 

0.14 

0.04 

vis 

0.04 

0.18 

0.19 

0.02 

0.49 

0.01 

0.00 

0.00 

0.10 

0.00 

0.00 

V19 

0.16 

0.07 

0.04 

0.20 

0.56 

0.01 

0.05 

0.03 

0.00 

0.00 

0.13 

V20 

0.11 

0.07 

0.16 

0.07 

0.54 

0.03 

0.00 

0.00 

0.10 

0.00 

0.00 

B  in  82  and  state  83  having  strong  H  and  A  component,  respectively. 

In  tables  5  and  6,  we  show  the  mine  and  the  clutter  model  of  cluster  2,  respectively.  As  shown 
earlier  in  figure  12,  cluster  2  has  a  large  number  of  clutter  alarms  with  few  weak  mines.  Therefore, 
the  means  of  states  81,  82,  and  83  of  A^M^  have  respectively  weaker  D,  A,  and  A  components 
compared  to  the  states  means  of  \$M\  reported  in  table  4.  However,  the  non-edge  A  component 
of  A 2^)  states  means  (si([A])  =  0.39,  82 ([A"])  =  0.51,  83 ([A])  =  0.38)  is  higher  than  those  of  A ^ 
(81  ([A])  =  83 ([A])  =  0.16  and  82 ([A])  =  0.17).  Moreover,  the  non-edge  component  is  significantly 
higher  in  the  A!>  ;  model’  states  means  (81  ([A])  =  83 ([A])  =  0.75  and  82 ([A])  =  0.69),  as  shown  in 
table  6. 

Given  the  set  {A^c\  k  =  1,  •  •  •  ,  A,  c  E  {M,  C}}  of  models  learned  (refer  to  tables  4,  5,  and  6 
for  few  examples  of  clusters’  mine  and  clutter  models),  we  analyze  the  performance  of  these  models 
on  the  training  data.  In  Figure  16,  we  plot  the  likelihood  of  a  mine  signature  belonging  to  cluster  1 
in  all  cluster  models.  As  expected,  the  highest  likelihood  occurs  when  testing  the  sequence  with  the 
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0.015 


Figure  16.  (a)  A  sample  alarm  signature  assigned  to  cluster  1.  (b)  Clusters’  models  responses  to 
the  signature  in  (a) 


7i . 


Model  of  cluster  # 


(a) 


(b) 


Figure  17.  (a)  A  sample  alarm  signature  assigned  to  cluster  5.  (b)  Clusters’  models  responses  to 
the  signature  in  (a) 


DHMM  model  of  cluster  1.  Moreover,  the  higher  likelihood  values  correspond  to  the  mine-dominated 
clusters  (1, 4,  5,  6,  •  •  •  ,  14, 15, 17,  •  •  • ). 

The  same  remark  is  valid  for  the  models’  response  to  a  test  signature  of  a  strong  mine  from 
cluster  5,  as  shown  in  figure  17.  In  this  case,  fewer  models  responded  to  that  signature.  These  models 
correspond  to  clusters  containing  similar  signatures  (strong  mines).  Similarly,  figure  18  shows  that 
a  test  sequence  belonging  to  cluster  2  (which  is  dominated  by  clutter  signatures)  has  high  likelihood 
in  only  clutter-dominated  clusters’  models. 

Figure  19  shows  the  scatter  plot  of  the  log-likelihood  of  the  training  data  in  model  5  (ordinate 
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0.35 

0.3 


0.25 


Model  of  cluster  # 

(a)  (b) 

Figure  18.  (a)  A  sample  alarm  signature  assigned  to  cluster  2.  (b)  Clusters’  models  responses  to 
the  signature  in  (a) 
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Figure  19.  Scatter  plot  of  the  log-likelihoods  of  the  training  data  in  model  5  (strong  mines)  Vs. 
model  1  (weak  mines).  Clutter,  low  metal  (LM),  and  high  metal  (HM)  signatures  at  different  depths 
are  shown  with  different  symbols  and  colors. 
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axis)  versus  model  1  (abscissa  axis).  In  this  figure,  we  display  clutter,  low  metal  (LM),  and  high 
metal  (HM)  signatures  at  different  depths  using  different  symbols  and  colors.  Even  though  the  two 
models  are  dominated  by  mine  signatures,  we  see  that  not  all  confidence  values  are  highly  correlated. 
In  fact,  some  strong  mine  signatures  have  high  log-likelihoods  in  model  5  and  lower  log-likelihoods 
in  model  1  (upper  left  side  of  the  scatter  plot,  region  Rl).  This  can  be  attributed  to  the  fact  that 
cluster  5  contains  mainly  strong  mines  and  is  more  likely  to  yield  high  log-likelihood  when  testing 
a  strong  mine  signature.  On  the  other  hand,  in  region  i?2,  the  performance  of  cluster  1  model  is 
better  as  it  gave  higher  likelihood  values  to  the  ’’weak”  mines  in  that  region.  In  fact  this  result  is 
expected  because  cluster  1  contains  weak  mine  signatures. 


Logiikelihood  in  Model  5 


Figure  20.  Scatter  plot  of  the  log-likelihoods  of  the  training  data  in  model  12  (clutter)  Vs.  model 
5  (strong  mines).  Clutter,  low  metal  (LM),  and  high  metal  (HM)  signatures  at  different  depths  are 
shown  with  different  symbols  and  colors. 


In  Figure  20,  we  display  the  scatter  plot  of  the  log-likelihood  of  all  the  training  data  in  model 
12  (clutter)  versus  model  5  (mines).  As  clusters  5  and  12  are  dominated  respectively  by  mines  and 
clutter,  their  results  are  negatively  correlated.  That  is,  cluster  5  assigns  higher  log-likelihood  to  mine 
signatures  and  lower  log-likelihoods  to  clutter  signatures.  Cluster  12  model  does  the  opposite.  In 
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region  R1  in  Figure  20,  strong  signatures  (high  metal  mines  and/or  mines  buried  at  shallow  depths) 
are  being  assigned  high  likelihood  values  in  cluster  5  model  and  very  low  likelihood  values  in  cluster 
12  model.  In  region  R2,  a  group  of  clutter  signatures  and  deeply-buried  low  metal  mines  have  high 
likelihoods  in  cluster  12  model  and  very  low  likelihood  in  cluster  5  model. 

Figures  19  and  20  illustrate  that  the  learned  HMM  models  give  different  results  on  different 
regions  of  the  input  space  and  that  they  are  potentially  complementary. 

Although  each  cluster  may  have  one  or  two  models,  its  output  is  a  single  confidence  value. 
Precisely,  the  output  of  the  BW-trained  models  is  Pr(0\\i)  while  the  output  of  the  MCE-trained 
models  is  Pr(0\Xj)  =  Pr(0\X^)  —  Pr(0\X ^).  Therefore,  we  can  consider  each  cluster  model  as  a 
classifier  and  analyze  its  overall  performance  on  the  training  data.  In  figure  21,  we  plot  the  ROCs 
of  some  cluster  models.  The  clusters  ROCs  show  that  the  models  perform  differently  at  different 
levels  of  false  alarms  rate.  We  also  notice  that  no  model  consistently  outperforms  all  other  models. 
For  instance,  as  shown  previously  in  the  scatter  plots  of  figures  19  and  20,  cluster  5  model  assigns 
high  confidence  values  to  strong  mines  but  fails  to  recognize  weak  signatures. 


Figure  21.  ROCs  of  some  clusters  models.  Solid  line:  clusters  dominated  by  mines.  Dashed  line: 
clusters  dominated  by  clutter. 
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3.3.4  Decision  level  fusion  of  multiple  HMMs 


In  the  previous  steps  of  our  ensemble  HMM,  we  initialized  and  learned  multiple  models  for 
the  different  clusters.  Any  sequence  O  will  be  tested  by  these  models.  Depending  on  the  type  and 
the  contents  of  the  clusters,  one  or  more  models  may  have  strong  response  to  the  test  sequence  O. 
The  multiple  responses  of  the  different  models  need  to  be  combined  into  a  single  confidence  value. 
We  propose  using  the  procedure  highlighted  in  Algorithm  6  to  achieve  this  task. 


Figure  22.  Block  diagram  of  the  proposed  eHMM  (testing) 

The  block  diagram  for  the  proposed  eHMM,  in  testing  mode,  is  given  in  figure  22.  The 
output  of  Baum- Welch-  and  VB-trained  cluster  models  is  Pr(0  |A&)  while  the  output  of  the  MCE- 
trained  cluster  models  is  maxc  Pr(0|A^cC£;).  Although  each  cluster  may  have  up  to  C  models, 
its  output  is  a  single  confidence  value  that  represents  the  degree  to  which  the  input  test  sequence 
”  belongs”  to  that  cluster. 
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Algorithm  6  Testing  a  new  sequence  using  the  eHMM 

Require:  Test  observation  0 

Ensure: 

l:  Compute  Pr{0\\fw)  for  the  K\  clusters  learned  with  BW 

2:  Compute  Pr(0 \\j1CE)  =  maxc  Pr(0\\j^EE),  for  the  K2  clusters  learned  with  MCE 
3:  Compute  Pr(0\\VB)  for  the  K 3  clusters  learned  with  VB 
4:  Pr(O)  =  H(Pr(0|Afw),  Pr(0\\fCE),  Pr(0\\\B)) 


The  general  framework  for  decision  combination  is  the  following.  Let  A  =  {\f w,  Aj^CE,  A  ^Bj 
be  the  resulting  mixture  model  composed  of  a  total  of  K  models,  K  =  K\  +  K2  +  K%.  Let 

F(fc, r)  =  logPr(Or|A/c),  1  <  r  <  R,  and  1  <  k  <  K  (92) 

be  the  log-likelihood  matrix  obtained  by  testing  the  R  training  sequences  with  the  K  models.  Each 
column  fr  of  the  matrix  F  represents  the  feature  vector  of  each  sequence  in  the  decision  space 
(recall  that  fr  is  a  K-dimensional  vector  while  Or  is  a  sequence  of  vector  observations  of  length  T, 
15  x  4  sequence  in  the  case  of  the  landmine  application  example).  In  fact,  each  column  represents 
the  confidences  assigned  by  the  K  models  to  each  sequence  r.  Therefore,  the  set  of  sequences 
0=  {Or,yr}B= l  is  mapped  to  an  Euclidean  confidence  space  {f r,yr}B=1.  Finally,  a  combination 
function,  H,  takes  all  the  fr’s  as  input  and  outputs  the  final  decision. 

Several  decision  level  techniques  such  as  simple  algebraic  [63],  artificial  neural  networks 
(ANN)  [1],  and  hierarchical  mixture  of  experts  (HME)  [46]  can  be  used. 

3.3.4. 1  Algebraic-based  fusion 

Simple  algebraic  methods  are  not  trainable  and  rely  on  the  statistics  of  the  clusters  contents 
and  the  corresponding  models  performance.  The  assumption  is  that  each  cluster  is  associated  with 
one  and  only  one  class.  A  class  can  be  represented  by  more  than  one  cluster.  In  the  case  of 
MCE-trained  cluster  models,  the  corresponding  class  is  the  class  in  which  the  observation  has  the 
maximum  likelihood.  Let  class(k )  refer  to  the  class  label  of  the  majority  of  the  samples  assigned  to 
cluster  k.  The  aggregation  function  could  be  a  simple  majority  vote,  i.e., 

K 

IHI(fj)  =  argmax^^  fikS(class(k)  =  c),  1  <  i  <  R,  (93) 

c=i:c  £i 

or  the  max  operator, 

H(f i)  =  class (argmax/^),  1  <  i  <  R.  (94) 

k=l:K 

Since  we  adopted  an  unsupervised,  non- trainable,  approach  to  data  partitioning,  it  is  not  always 
possible  to  identify  reliable  cluster-to-class  associations  and  derive  adequate  statistics.  An  alternative 
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is  to  use  ANNs  and  HMEs  to  automate  these  associations  and  learn  the  corresponding  weights  from 
the  performance  of  the  models  on  the  training  data. 

3. 3. 4. 2  ANN-based  fusion 

The  basic  decision  combination  neural  network  (DCNN)  [1]  is  a  single  layer  perceptron 
with  no  hidden  layers  taking  {fi, . . . ,  f#}  as  input  and  outputting  the  predicted  labels  {z i, . . . ,  zRj. 
Borrowing  the  framework  introduced  in  [1],  the  proposed  ANN  architecture  is  illustrated  in  figure 
23.  The  combination  function  for  this  network  is 


Zc 


[i] 


R  K 

hd  =  Wikcfik  T  (95) 

i— 1  k= 1 

where  Wikc  and  0C  are  respectively  the  weights  and  the  per-class  bias  of  the  DCNN.  The  final  output 
is  casted  as  a  sigmoid  function  of  the  form 


Heft,--  -  ,fR)  = 


1 

l  _|_  e~hc(f ,fR)  ' 


(96) 


The  advantage  of  this  architecture  is  that  each  weight  has  an  interpretation  for  the  role  it  plays 
in  the  decision  combination.  Weight  Wikc  is  the  contribution  to  class  Cc  when  classifier  A&  assigns 
confidence  fik  to  data  sample  i.  The  weights  are  learned  using  a  backpropagation  [64]  algorithm 
that  minimizes  the  mean  square  error  (MSE)  between  the  true  labels  and  the  actual  outputs  of  the 
DCNN  on  the  training  data. 

The  DCNN  framework  has  several  desirable  properties.  First,  it  can  be  considered  as  a 
generalization  of  the  linear  combination  models.  Second,  it  is  easily  trainable.  In  other  words, 
training  data  could  be  used  to  learn  the  weights.  Finally,  it  is  not  sensitive  to  possibly  correlated 
classifiers,  as  redundant  classifiers  could  be  eliminated  through  standard  ANN  pruning  techniques 
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[65].  The  main  drawback  of  ANNs  is  that  they  are  not  easily  interpretable  and  could  suffer  from 
overfitting. 

3. 3. 4. 3  HME-based  fusion 

As  highlighted  in  section  2.3.3,  the  HME  method  utilizes  a  tree-based  hierarchy  of  experts 
and  can  be  applied  to  various  supervised  learning  problems.  For  the  eHMM  framework,  models 
learned  from  clusters  of  observations  can  be  treated  as  experts  in  different  domains.  The  leaves 
of  such  a  hierarchy  represent  experts,  while  the  non- leaves  constitute  a  gating  network  designed  to 
dynamically  combine  the  outputs  of  the  subtrees.  The  input  to  the  HME  network  is  a  K-dimensional 
vector  F.  Equation  (64)  maps  the  input  of  the  HME  to  the  output,  denoted  y ,  via  a  probabilistic 
model.  In  this  case,  HI  is  a  probability  density  function,  P,  conditioned  on  the  input  F  and  the 
model  parameters  0  of  the  HME: 

P(y\F,e)  =  E 9i E  g^P^YAj)-  (97) 

i  3 

In  the  case  of  binary  classification,  the  output  y  is  a  discrete  random  variable  having  two 
possible  classes  (or  two  possible  values:  0  or  1).  In  this  case,  equation  (97)  reduces  to: 

P(y\F,e)  =  E^E^KdH  “/iu)1_2/-  (98) 

i  3 

Here  fiij  is  the  conditional  probability  of  classifying  the  input  as  class  1.  Note  that  the  dependence 
on  the  input  is  through  the  gating  networks  gi  and  g 

The  HME  parameters  are  learned  from  data  via  an  EM-like  algorithm  [46]  or  via  gradient 
descent  techniques  [45]. 

3. 3. 4. 4  Illustrative  example:  fusion  results  using  HME  and  ANN 

In  our  illustration  example,  we  combine  the  cluster  model  outputs  using  two  methods:  ANN 
and  HME  fusions.  In  the  case  of  ANN  combination,  the  weights  are  relative  to  the  clusters  models’ 
performance  and  are  learned  from  the  models’  response  to  the  training  data.  In  figure  24,  we  report 
the  weights  assigned  to  each  cluster  model  after  training  the  ANN.  Typically,  one  should  expect  that 
the  weights  assigned  to  clutter  dominated  models  are  negative  and  those  assigned  to  mine  dominated 
clusters  are  positive. 

In  the  case  of  HME  combination,  each  model  is  considered  an  ’’expert”  in  a  particular  region 
of  the  feature  space.  The  gating  networks  of  the  HME  control  the  regions  on  which  a  particular 
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Figure  24.  Weights  assigned  to  the  20  HMMs  using  ANN.  (a)  Distribution  of  the  alarms  in  each 
cluster  per  class  (mines  vs.  clutter),  (b)  ANN  weights  and  bias 


expert  model  has  better  performance.  The  posterior  probability  at  a  particular  node  in  the  HME  is 
the  likelihood  output  of  the  expert  model.  The  gating  network  coefficients  are  learned  by  optimizing 
the  performance  of  the  HME  network  on  the  training  data  using  the  EM  algorithm.  We  use  a  1-level 
HME  with  branching  factor  of  two.  We  show  the  gating  network  parameters  associated  to  each 
model  (expert  network)  in  table  7.  The  gating  network  parameters  ,  vi  s,  can  be  viewed  as  a  soft 
partition  of  the  original  IT— dimensional  confidence  space. 

Given  a  test  observation  x,  its  inner  product  with  the  gating  networks  vectors  (shown  in 
table  7),  combined  with  the  softmax  smoothing  function,  determines  the  degree  to  which  the  test 
point  is  to  be  treated  by  each  of  the  expert  networks.  Geometrically,  vectors  v0,  Vn,  and  vi2  can 
be  viewed  as  the  normals  to  the  regression  surfaces  of  the  gating  networks.  The  higher  the  absolute 
value  of  one  covariate  of  v,  the  more  weight  is  given  to  the  corresponding  model  confidence.  As 
mentioned  earlier,  the  parameters  v0,  vn,  and  vi2  are  learned  via  the  EM  algorithm.  In  figure  25, 
we  show  the  evolution  of  the  mean  log  likelihood  along  the  EM  iterations.  As  it  can  be  seen,  the 
overall  log-likelihood  of  the  HME  reaches  a  plateau  around  iteration  #40. 

To  evaluate  the  HME  and  ANN  fusion  methods,  we  compare  the  ROCs  generated  by  these 
two  methods,  using  the  training  data,  to  the  ROCs  of  the  best  three  clusters’  models.  As  it  can 
be  seen  in  figure  26,  both  fusion  methods  are  consistently  better  than  the  three  clusters’  models. 
Also,  we  can  see  that  the  HME  fusion  slightly  outperforms  the  ANN  fusion.  This  is  possibly  due 
to  the  fact  that  the  HME  is  more  suited  for  highly  non  linear  classification  problems.  In  this  kind 
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TABLE  7 


HME  gating  network  parameters  after  EM  training 


Dimension 

v0 

vn 

Vl2 

1 

-170.222 

-0.355 

-0.015 

2 

48.345 

-0.498 

0.485 

3 

-479.007 

-0.632 

-2.609 

4 

13.281 

-0.264 

0.351 

5 

51.229 

0.220 

-1.485 

6 

-129.177 

0.188 

-1.712 

7 

102.388 

0.564 

1.402 

8 

180.394 

1.243 

1.212 

9 

249.114 

0.882 

2.778 

10 

29.622 

-0.023 

0.354 

11 

175.361 

-1.182 

3.203 

12 

4.915 

-0.191 

-1.015 

13 

-96.080 

0.072 

-0.644 

14 

-135.543 

-1.182 

2.518 

15 

57.147 

-0.447 

0.318 

16 

-55.869 

-0.276 

-0.421 

17 

43.716 

0.209 

0.504 

18 

-9.176 

0.429 

-2.111 

19 

-531.275 

-0.888 

-2.791 

20 

36.375 

0.075 

0.219 

Intercept 

-154.275 

0.087 

14.588 

iteration 


Figure  25.  Log-likelihood  of  the  HME  during  EM  training 

of  settings  (high  non  linearity),  the  ANN  is  prone  to  overhtting  while  the  HME  addresses  the  high 
nonlinearity  through  the  ’’soft”  partitioning  of  the  input  space,  i.e.  the  divide  and  conquer  approach. 
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Figure  26.  Comparison  of  the  ANN  and  HME  fusion  with  the  best  three  cluster  models  (1,  2,  and 
12). 


3.4  Chapter  summary 

In  this  chapter,  we  proposed  our  ensemble  HMM  classifier  and  its  components.  First,  We  gave 
the  motivations  behind  adopting  our  approach.  Then,  we  detailed  the  different  components/steps 
of  the  proposed  ensemble  HMM  classifier,  namely,  the  similarity  matrix  computation,  the  pairwise- 
similarity-based  clustering,  the  models  initialization  and  training,  and  the  decision  level  fusion.  We 
have  used  an  example  of  landmine  detection  using  GPR  signatures  and  EHD  features  to  illustrate 
and  motivate  each  step.  In  the  next  chapters,  we  provide  more  details  about  the  application  of  the 
ensemble  HMM  classifier  to  real  world  sequential  data  sets.  In  particular,  we  analyze,  in  chapter 
4,  the  results  of  the  eHMM  application  to  identify  CPR  scenes  in  video  simulating  medical  crises. 
Then,  in  chapter  5,  we  provide  a  more  comprehensive  results  and  analysis  of  the  application  of 
eHMM  to  landmine  detection. 
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CHAPTER  4 


APPLICATION  TO  CPR  SCENES  IDENTIFICATION 

In  this  chapter,  the  proposed  ensemble  HMM  classifier  is  used  to  identify  CPR  scenes  in  video 
simulating  medical  crises.  First,  we  provide  the  steps  needed  for  extracting  motion-based  features 
from  the  training  videos.  Then,  we  outline  the  proposed  eHMM  classifier  architecture.  Finally,  we 
compare  the  eHMM  classifier’s  performance  to  the  baseline  HMM  classifier. 

4.1  Introduction 

Medical  simulations,  where  uncommon  clinical  situations  can  be  replicated,  have  proved  to 
provide  a  more  comprehensive  training.  Simulations  involve  the  use  of  patient  simulators,  which  are 
lifelike  mannequins  that  have  respiration,  heartbeat,  and  respond  to  treatment  with  virtual  drugs. 
Simulations  sessions  involve  4  to  9  people  and  last  approximately  30  minutes  to  one  hour.  Simulation 
training  sessions  are  scheduled  approximately  twice  per  week  and  are  recorded  as  video  data.  After 
each  session,  the  physician/instructor  must  manually  review  and  annotate  the  recording  and  then 
debrief  the  trainees  on  the  session.  Video-assisted  debriefing  allows  participants  to  de-construct  and 
reflect  on  their  experiences,  teaching  them  how  to  approach  such  tasks  more  effectively  in  the  future. 

The  physician  responsible  for  the  simulation  sessions  has  recorded  over  100  sessions,  and  is 
now  realizing  that:  (1)  the  manual  process  of  review  and  annotation  is  labor  intensive;  (2)  retrieval 
of  specific  video  segments  is  not  trivial;  and  (3)  there  is  wealth  of  information  waiting  to  be  mined 
form  these  recordings.  Providing  the  physician  with  automated  tools  to  segment,  semantically 
index  and  retrieve  specific  scenes  from  a  large  database  of  training  sessions  will  enable  him/her  to: 
(1)  immediately  review  important  sections  of  the  training  with  the  team;  (2)  allow  more  efficient 
debriefing  session  with  the  team  of  trainees;  and  (3)  identify  similar  circumstances  in  previously 
recorded  sessions.  The  longer  term  payback  is  the  potential  discovery  of  similar  critical  elements  in  a 
training  session  that  result  in  either  positive  or  negative  outcomes  and  thus  enhance  the  effectiveness 
of  the  training. 


62 


4.2  Identification  of  CPR  scenes  in  video  simulating  medical  crises 


In  this  section,  we  outline  our  approach  to  use  the  eHMM  approach  to  detect  and  classify 
scenes  that  involve  CPR  activity.  Our  approach  consists  of  two  main  steps.  The  first  one  segments 
the  video  into  shots,  selects  one  keyframe  for  each  shot,  and  identifies  regions  with  skin-like  colors 
in  each  keyframe.  Each  skin  region  is  then  represented  by  a  sequence  of  observations  that  encode  its 
motion  in  the  different  frames  within  the  shot  boundaries.  The  second  step  consists  of  building  an 
ensemble  HMM  classifier  that  uses  motion-based  features  in  order  to  identify  the  skin-like  regions 
that  involve  CPR  activities. 

4.2.1  Video  and  image  segmentation 

First,  a  process  that  extracts  shot  boundaries  to  make  a  set  of  shots  is  needed.  A  shot  is 
defined  as  a  video  segment  within  a  continuous  capture  period.  To  include  the  effect  of  dynamic  visual 
content,  various  cues  such  as  color,  motion,  mosaic  of  frames  can  be  adopted.  The  most  common 
approaches  to  shot  boundary  detection  are  based  on  color  histogram  [66],  probability  theory  [67], 
and  unsupervised  clustering  [68].  Other  approaches,  based  on  motion  are  generally  better  suited 
for  controlling  the  number  of  frames  based  on  temporal  dynamics  in  the  scene.  These  approaches 
include  pixel-based  image  differences  [69],  optical  flow  computation  [70],  and  global  motion  and 
gesture  analysis  [71]. 

After  shot  boundary  detection,  a  key-frame  that  reflects  the  main  content  of  each  scene 
needs  to  be  extracted.  Most  of  the  early  work  selects  key-frames  by  randomly  or  uniformly  sampling 
the  video  frames  from  the  original  sequence  at  certain  time  intervals  [72].  Since  a  shot  is  defined 
as  a  video  segment  within  a  continuous  capture  period,  a  natural  and  straightforward  way  of  key- 
frame  extraction  is  to  use  the  first  frame  of  each  shot  as  its  key-frame  [73,  70].  In  our  proposed 
approach,  we  adopt  the  latter  method.  Next,  each  key- frame  is  segmented  into  a  small  number  of 
homogeneous  regions.  These  regions  will  provide  an  organization  framework  for  subsequent  scene 
analysis.  Segmentation  is  achieved  by  clustering  the  color  pixel  information  into  non  overlapping 
regions  of  similar  features.  Any  clustering  algorithm  could  be  used  to  achieve  this  task.  We  use 
the  normalized  cut  algorithm  (Ncut)  [2].  Our  choice  is  motivated  by  the  computational  efficiency 
of  this  algorithm  and  its  ability  to  cluster  the  image  into  a  reasonable  number  of  regions  with  no 
supervision  information.  Figure  27(a)  displays  a  key- frame  from  a  sample  shot  extracted  from  one 
of  the  training  sessions.  Figure  27(b)  displays  the  segmented  regions  obtained  by  using  Ncut. 
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(a)  Key-frame  of  a  sample  shot  (b)  Segmented  key-frame 

Figure  27.  Key-frame  segmentation  using  Ncut  [2]  algorithm 


4.2.2  Skin  detection 

Instead  of  processing  all  detected  regions,  we  identify  only  those  that  are  of  interest.  Since 
our  objective  consists  of  identifying  CPR  scenes  and  since  this  action  typically  involves  the  trainees 
hands  and  the  mannequin  chest,  we  identify  and  keep  only  those  regions  with  skin-like  colors.  To 
achieve  this  task,  we  use  a  simple  but  efficient  skin  pixel  classifier  [74]  to  discriminate  between  skin 
and  non-skin  regions.  We  use  a  collection  of  skin  images  to  train  the  classifier.  This  collection 
includes  500  images  of  hands  at  different  orientations  and  under  different  illumination.  First,  each 
image  is  mapped  to  the  (r,  g)  color  space  where  r  =  and  g  =  .  This  normalized 

color  space  has  proved  to  be  suitable  for  representing  skin  color  under  different  lighting  conditions. 
Each  image  is  then  filtered  using  a  low-pass  filter  to  reduce  the  effect  of  noise.  Second,  the  color 
distribution  of  all  pixels  in  all  images  is  fitted  by  a  Gaussian  model  with  mean  g  and  a  covariance 
matrix  E. 

Let  Xij  =  (rij,gij)  be  the  color  of  the  ith  pixel  in  region  Rj.  The  likelihood  of  Xij  in  the 
Gaussian  skin  model  can  be  estimated  using 

£(xij)  =  exp(—0.5(xij  —  g)T1~1(xij  —  g)  (99) 

A  region  is  labeled  as  a  skin  region,  and  retained  for  further  processing,  if  many  of  its  pixels 
have  likelihood  larger  than  a  given  threshold  0. 

Figure  28  illustrates  the  skin  detection  process.  In  figure  28(a),  we  display  the  likelihood  of 
all  pixels  of  figure  27(a)  in  the  Gaussian  skin  model.  Here,  red  pixels  indicate  high  likelihood  while 
blue  pixels  indicate  zero  likelihood. 
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(a)  Skin  likelihood  (b)  Identified  skin  regions 

Figure  28.  Skin  detection  using  a  Gaussian  skin  model 


(a)  One  of  the  skin  regions  (b)  Optical  flow 

Figure  29.  Optical  flow  of  one  of  the  identified  regions  in  figure  28(b) 


4.2.3  Motion  features  extraction 

Motion  is  an  efficient  feature  to  analyze  moving  objects  that  change  location  with  time  [75]. 
For  each  skin  region,  we  extract  a  motion  feature  vector.  First,  we  compute  the  optical  flow  for  all 
pixels  within  the  region  using  Horn-Schunck  algorithm  [76].  Optical  flow  measures  the  change  in 
velocity  in  terms  of  speed  and  direction  at  each  pixel  location.  Then,  the  optical  flow,  u  j  = 
of  the  region  of  interest  Rj,  is  estimated  as  the  average  of  the  optical  flow  of  all  of  its  pixels.  Figure 
29(a)  displays  one  of  the  skin  labeled  regions  in  figure  28(b).  In  figure  29(b),  we  superimpose  the 
optical  flow  of  each  pixel  in  the  region  as  well  as  the  average  optic  flow  u j  (in  red).  In  this  case,  the 
pixels  are  moving  downward,  indicating  the  push-down  phase  of  the  CPR. 

The  average  optical  flow  u j  is  computed  at  each  frame  of  a  given  shot.  Thus,  skin  region  Rj 
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(a)  Average  optical  flow  of  a  typical  CPR  sequence 


(b)  Average  optical  flow  of  a  non-CPR  sequence 
Figure  30.  Average  optical  flow  sequence  of  two  sample  regions 


within  the  kth  video  shot  will  be  represented  by  the  sequence  of  observations: 

o jk  =  {u)fe,  •  •  •  ,  U*jk,  •  •  •  ,  u Jfc},  (100) 

where  T  is  the  number  of  consecutive  frames,  within  video  shot  k,  that  include  skin  region  Rj. 
Figure  30  displays  the  average  optical  flow  of  one  skin  region  that  involves  CPR  activities  and 
another  non-CPR  skin  region  over  a  sequence  of  50  frames.  As  it  can  be  seen,  the  CPR  sequence 
can  be  characterized  by  an  upward  motion  followed  by  a  downward  motion.  On  the  other  hand,  the 
non-CPR  sequence  has  no  specific  pattern  and  the  motion  vector  tend  to  be  random. 


4.2.4  Video  shot  classification  using  HMMs 
4. 2.4.1  Baseline  HMM  classifier 

To  discriminate  between  skin  regions  that  are  involved  in  CPR  activities  and  other  skin 
regions,  we  developed  and  trained  a  Hidden  Markov  Model  (HMM)  classifier  that  is  based  on  motion 
features.  The  baseline  HMM  classifier  (bHMM)  is  comprised  of  two  models:  one  for  CPR  sequences 
and  one  for  non-CPR  sequences.  Each  model  has  two  states  and  produces  a  probability  value 
by  backtracking  through  model  states  using  the  Viterbi  algorithm  [8].  The  CPR  model,  XCPR , 
is  designed  to  capture  the  smooth  transition  between  state  1  and  state  2  that  characterizes  CPR 
sequences.  The  non  CPR  model,  A~CPi?,  is  needed  to  capture  the  characteristics  of  non-CPR 
sequences.  The  probability  value  produced  by  the  CPR  (non-CPR)  model  can  be  thought  of  as 
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an  estimate  of  the  probability  of  the  observation  sequence  given  that  there  is  a  CPR  (non-CPR) 
model  present.  These  probabilities  are  produced  by  backtracking  through  the  model  states  using 
the  Viterbi  algorithm  [8]  .  The  confidence  value  assigned  to  each  observation  sequence,  Conf{0 ), 
depends  on:  (1)  the  probability  assigned  by  the  CPR  model  XCPR ,  and  (2)  the  probability  assigned 
by  the  non-CPR  model  X~CPR.  In  particular,  we  use: 

ConfiO)  =  log  Pr(0\XCPR)  -  log  Pr{0\X~CPR).  (101) 

4. 2. 4. 2  Ensemble  HMM  classifier 

In  the  baseline  HMM  classifier,  the  CPR  scenes  are  modeled  by  a  single  HMM.  The  goal  is  to 
generalize  from  all  the  training  data  in  order  to  classify  unseen  scenes.  However,  generalizing  from 
all  the  motion  vectors  might  lead  to  too  much  averaging  and  thus  loosing  some  of  the  discriminating 
characteristics  of  the  CPR  scenes.  An  illustrative  example  of  this  drawback  was  given  in  section 
3.2.  In  order  to  overcome  this  limitation,  the  two  model  classifier  is  replaced  by  our  ensemble  HMM 
(eHMM).  Using  multiple  models  can  capture  the  variations  of  the  CPR  scenes  that  may  be  lost 
under  the  effect  of  averaging  in  the  two  models  case.  The  implementation  of  the  eHMM  for  CPR 
classification  will  be  detailed  later  in  section  4.3.4. 

4.3  Experimental  results 

4.3.1  Data  collection  and  preprocessing 

To  validate  our  proposed  approach  to  detect  CPR  shots,  we  use  the  video  of  one  simulation 
session.  The  duration  of  the  session  is  30  min  and  19  sec,  with  29  frames  per  sec.  Each  frame  has 
720  x  480  resolution.  After  shot  boundary  detection,  key-frame  segmentation,  and  skin  detection, 
we  obtain  a  total  of  N  =  122  skin  regions.  We  track  each  region,  Rj,  over  all  frames  within  its  shot 
(k)  boundary,  and  extract  the  sequence  of  observations  Ojk  (as  defined  in  (100)).  For  validation 
purposes,  we  examine  each  sequence  and  assign  the  ground  truth  labels  (CPR  or  non-CPR).  We 
obtain  a  total  of  42  sequences  labeled  positively  as  representing  CPR  activity,  and  80  non-CPR 
sequences.  We  also  fix  the  sequence  length  T  to  20  even  though  HMM  can  handle  sequences  of 
variable  length.  This  is  a  reasonable  assumption  as  20  sequences  tend  to  cover  the  upward  and 
downward  motion  of  the  hands  in  most  CPR  sequences.  Moreover,  HMM  classifiers  are  simpler  and 
more  efficient  when  all  sequences  have  a  constant  length. 

For  our  experiments,  we  use  a  4-fold  cross  validation.  For  each  fold,  a  subset  of  the  data  is 
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used  for  training  (Sr™)  and  the  remaining  data  is  used  for  testing  (®Tst)-  Let  Rn  be  the  size  of 
^Trn  and  Qn  =  TV  —  Rn  be  the  size  of  T)rst  for  the  nth  fold.  For  training,  we  denote  i^Trn')  and 
(2) TrnR )  the  subsets  of  the  CPR  sequences  and  the  non-CPR  sequences,  respectively. 

4.3.2  Evaluation:  Receiver  Operating  Characteristic  (ROC)  curve 

The  ROC  curve,  is  a  graphical  plot  of  the  sensitivity  vs.  specificity  for  a  binary  classifier 
system  as  its  discrimination  threshold  is  varied.  The  ROC  can  also  be  represented  equivalently  by 
plotting  the  fraction  of  true  positives  (TPR  =  true  positive  rate)  vs.  the  fraction  of  false  positives 
(FPR  =  false  positive  rate).  Consider  a  two-class  prediction  problem  (binary  classification),  in 
which  the  outcomes  are  labeled  either  as  positive  (p)  or  negative  (n)  class.  There  are  four  possible 
outcomes  from  a  binary  classifier.  If  the  outcome  from  a  prediction  is  p  and  the  actual  value  is  also 
p,  then  it  is  called  a  true  positive  (TP);  however  if  the  actual  value  is  n  then  it  is  said  a  false  positive 
(FP).  Conversely,  a  true  negative  occurs  when  both  the  prediction  outcome  and  the  actual  value 
are  n,  and  false  negative  is  when  the  prediction  outcome  is  n  while  the  actual  value  is  p.  Let  us 
define  an  experiment  from  P  positive  instances  and  N  negative  instances.  The  four  outcomes  can  be 
formulated  in  a  2  x  2  contingency  table  or  a  confusion  matrix,  (refer  to  table  8).  To  draw  an  ROC 
curve,  only  the  true  positive  rate  (TPR)  and  false  positive  rate  (FPR)  are  needed.  TPR  determines 
a  classifier  or  a  diagnostic  test  performance  on  classifying  positive  instances  correctly  among  all 
positive  samples  available  during  the  test.  FPR,  on  the  other  hand,  defines  how  many  incorrect 
positive  results  while  they  are  actually  negative  among  all  negative  samples  available  during  the 
test. 

An  ROC  space  is  defined  by  FPR  and  TPR  as  x  and  y  axes  respectively,  which  depicts  relative 
trade-offs  between  true  positive  (benefits)  and  false  positive  (costs).  Since  TPR  is  equivalent  with 
sensitivity  and  FPR  is  equal  to  1-specificity,  the  ROC  graph  is  sometimes  called  the  sensitivity  vs 
(1-specificity)  plot.  Each  prediction  result  or  one  instance  of  a  confusion  matrix  represents  one  point 
on  the  ROC  curve. 


TABLE  8 
Contingency  table 


P 

n 

total 

p’ 

True  Positive 

False  Positive 

P’ 

n’ 

False  Negative 

True  Negative 

N’ 

total 

P 

N 

P+N 
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4.3.3  Video  shot  classification  using  baseline  HMM 


For  each  cross  validation  fold,  we  build  a  two-model  classifier:  one  HMM  model,  XCPR , 
based  on  the  subset  °f  CPR  training  sequences;  and  another  model,  X~CPR,  based  on  non- 

CPR  training  sequences  We  assume  that  \CPR  has  two  states  si  and  S2 :  one  represents 

the  upward  motion  of  the  hands  and  the  second  represents  the  downward  motion.  These  states  are 
estimated  using: 


)  Si  =  average{uj  e  (S^)|uy  >  0} 

<  (102) 

I  s2  =  average{uj  G  {^Trn)^)  < 

The  non-CPR  HMM  model,  X~CPR,  also  has  two  states.  Since  this  model  follows  no  specific 
pattern,  we  let  its  states  be  the  two  centers  obtained  by  clustering  ®T?nR  using  the  Fuzzy  C-Means 
(FCM)  algorithm  [57]  with  C=2. 

The  remaining  parameters  (A,B,7r)  of  \CPR  and  \~CPR  are  initialized  as  follows.  The 
priors,  tti  and  7T2,  are  estimated  as  the  percentage  of  training  sequences  that  start  with  state  si  and 
S2  respectively.  The  transition  matrix  is  initialized  using: 


A  = 


0.5  0.5 
0.5  0.5 


The  initialization  of  the  emission  probabilities  for  each  state  depends  whether  we  use  a  discrete  or 
a  continuous  HMM  model. 


•  For  the  discrete  case,  the  emission  matrix,  B,  is  initialized  based  on  the  states  and  a  codebook 
of  size  M=20.  First,  codewords  {vm,m  =  1,  •  •  •  ,20}  are  estimated  as  the  centers  obtained 
by  clustering  the  training  data  using  the  FCM  algorithm  with  C  =  20.  For  the  CPR  model, 
we  eliminate  codewords  that  have  low  magnitude  (\uy\  <  0.5).  This  will  prevent  observations 
with  small  motion  vectors  (mainly  from  non-CPR  sequences)  from  having  high  probabilities 
in  those  codewords.  Then,  the  emission  probabilities  are  initialized  using 

-l 

^  /or  j  =  1,2  and  m  =  1,  *  •  •  ,  20.  (103) 

Finally,  bj(m )  are  normalized  such  that  the  probabilities  in  each  state  sum  to  one. 

•  For  the  continuous  case,  the  emission  probabilities  are  modeled  by  mixtures  of  Gaussians  with 
three  components  for  each  state.  For  each  state,  we  cluster  the  observations  belonging  to  that 
state  using  FCM  algorithm  with  C  =  3.  The  means  of  the  Gaussian  mixture  components  of 


bj(m)  = 


Li= i 


vm  -  So 


vm  - 
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each  state  are  the  centers  of  the  clusters.  The  covariance  of  each  state  component  is  estimated 
using  the  observations  that  belong  to  the  corresponding  cluster.  For  computational  reasons, 
only  the  diagonal  elements  of  the  covariance  are  considered  and  all  covariance  values  are  lower- 
bounded  to  a  threshold  M inVar  =  0.1. 

After  initialization,  the  transition  probabilities  and  the  emission  probability  parameters  of 
each  model  are  fine-tuned  using  either  the  discrete  or  continuous  version  of  the  Baum- Welch  learning 
algorithm  [9]  and  the  respective  training  data.  We  denote  A d^/mm  and  the  resulting  discrete 

and  continuous  BW-trained  CPR  models,  respectively.  Similarly,  we  define  A and 
for  the  non- CPR  models. 

4.3.3. 1  Baseline  discrete  HMM  results 


TABLE  9 


Baseline  DHMM  model  parameters  of  A 


Means  Initial  A  A 


ux 

uy 

Sl 

52 

5l 

52 

Si 

0.16 

0.64 

0.5 

0.5 

0.85 

0.15 

52 

-0.11 

-0.48 

0.5 

0.5 

0.13 

0.87 

Codes 

Initial  B 

B 

Vl 

-5.47 

-1.65 

0.00 

0.00 

0.00 

0.00 

V2 

2.63 

1.77 

0.00 

0.00 

0.00 

0.01 

V3 

0.54 

1.75 

0.00 

0.00 

0.05 

0.00 

va 

0.28 

1.09 

0.41 

0.00 

0.16 

0.00 

-0.22 

-0.96 

0.00 

1.00 

0.09 

0.88 

V6 

-0.65 

-1.21 

0.00 

0.00 

0.00 

0.11 

v7 

-0.23 

1.36 

0.00 

0.00 

0.06 

0.00 

Vs 

-0.21 

0.82 

0.59 

0.00 

0.57 

0.00 

V9 

1.03 

1.21 

0.00 

0.00 

0.06 

0.00 

Table  9  displays  the  baseline  DHMM  model  A (^hmm  parameters.  In  particular,  the  state 
means,  state  transition  probabilities,  the  codewords,  and  their  emission  probabilities  in  each  state 
are  shown  for  both  the  initial  and  the  Baum- Welch  trained  models.  Since  codewords  that  have 
low  magnitude  (\uy\  <  0.5)  were  discarded,  A d^imm  ^as  on^A  rane  codewords  out  of  the  initial  20 
obtained  by  the  FCM  clustering.  Precisely,  codewords  {^2,  ^3,  ^4,  ^7,  vg,  ^9}  have  positive  uy  com¬ 
ponent,  and  {t’l,  r>4,  v$}  have  negative  uy  motion  vector  component.  The  corresponding  probability 
of  emission  of  each  codeword  does  not  depend  on  uy  but  rather  on  the  distance  of  the  codeword  to 
the  mean  of  each  state.  Moreover,  the  normalization  step  of  B  matrix  rows  allows  for  codewords 
with  negative  (resp.  positive)  uy  to  have  higher  probability  in  state  1  (resp.  state  2).  This  can  be 
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Training  Sequences  Observations 
+  State  1  Mean 
+  State  2  Mean 

•  Codes  with  higher  probability  in  state  1 

•  Codes  with  higher  probability  in  state  2 


is  o 


•V7 


•v. 


•V9 


•V2 


_L 

2 


Figure  31.  Training  observations,  states  representatives,  and  codewords  of  A %^/mm 

illustrated  better  by  visualizing  A %^/mm  parameters  in  the  (CP,uy)  plane  (refer  to  figure  31). 

Figure  31  displays  the  observation  vectors  used  to  train  the  CPR  model  (first  cross 

validation  set).  In  this  figure,  we  also  display  the  states  means  and  the  codewords  obtained  by 
clustering  the  observations.  As  mentioned  earlier,  we  notice  that  the  codeword  v\  (resp.  V2)  has 
higher  probability  in  state  1  (resp.  state  2)  even  though  its  uy  value  is  negative  (resp.  positive). 
However,  both  v\  and  V2  have  low  probabilities  in  both  states  (<  0.01  as  shown  in  table  9).  Thus, 
v\  and  V2  will  not  contribute  significantly  in  the  test  a  new  sequence  in  A S#mm- 

In  table  10,  we  display  the  initial  and  the  Baum- Welch  trained  A dS/m'm  parameters.  We 
notice  that  the  diagonal  elements  of  the  transition  matrix  A  are  higher  than  those  of  the  CPR  model 
of  table  9.  This  means  that  non-CPR  sequences  are  more  likely  to  stay  in  one  of  the  states,  unlike 
the  CPR  sequences  that  reflect  an  alternation  of  upward/downward  motion  along  uy  direction. 
We  notice  also  that  all  codewords  of  A with  high  magnitude  uy  >  1,  have  a  low  emission 
probability  (<  0.01)  in  both  states  (e.g.  ^i,  ^2,  ^3,^8,  •  •  •  )• 

Figure  32  illustrates  the  testing  process  of  a  sequence  by  A In  Figure  32(a),  we 
display  the  20  observations  of  a  typical  CPR  sequence.  As  it  can  be  seen,  this  sequence  corresponds 
to  a  region  that  starts  with  an  upward  motion  (3  observations),  followed  by  downward  motion 
(9  observations),  and  then  back  to  upward  motion  (8  observations).  In  Figure  32(b),  we  display 
(in  green  dots)  the  same  sequence  in  the  (u^u27)  plane.  As  it  can  be  seen,  the  first  and  last 
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Sample  CPR  Sequence 


(a)  20  observation  vectors  of  a  typical  CPR  sequence 


h  State  1  Mean 
i-  State  2  Mean 

*  Codes  with  higher  probability  in  state  1 

*  Codes  with  higher  probability  in  state  2 
Codes  with  low  probability 

*  Test  Sequence  Observations _ 


■■ 


- 


(b)  Path  assigned  by  the  Viterbi  algorithm  for  the  test  sequence  in  (a) 
Figure  32.  Illustration  of  testing  a  sequence  with  A d^/mm 


observations  (upward  motion)  will  be  assigned  to  the  codes  with  higher  probability  in  state  1  while 
the  middle  observations  (downward  motion)  will  be  assigned  to  the  codes  with  higher  probability 
in  state  2.  The  optimal  state  sequence,  assigned  by  the  Viterbi  algorithm,  to  this  test  sample  is 

11122222222211111111 

To  evaluate  the  proposed  approach,  we  score  it  in  terms  of  probability  of  detection  (PD) 
versus  probability  of  false  alarms  (PFA).  Confidence  values  are  thresholded  at  different  levels  to 
produce  the  receiver  operating  characteristics  (ROC)  curve. 

Figure  33  compares  the  ROC  curves  generated  by  the  trained  A ^Smm,  ^dhmm^  " 

A mm  models  (in  solid  lines)  and  the  corresponding  initial  models  before  training  (in  dashed  lines). 
As  it  can  be  seen,  the  CPR  model  outperforms  the  non-CPR  model  as  the  former  one  uses  prior 
knowledge  to  initialize  and  restrict  the  model  parameters.  We  notice  also  that  the  BW-training 
improves  significantly  the  classification  performance  of  the  CPR  model  and  the  baseline  DHMM 


72 


TABLE  10 


Baseline  DHMM  model  parameters  of  A 


Means 

Initial  A 

A 

UX 

uy 

«1 

52 

si 

52 

Sl 

0.04 

0.24 

0.5 

0.5 

0.87 

0.13 

^2 

-0.02 

-0.18 

0.5 

0.5 

0.08 

0.92 

Codes 

Initial  B 

B 

Vl 

-0.23 

2.47 

0.00 

0.00 

0.01 

0.00 

V2 

-3.19 

1.16 

0.00 

0.00 

0.01 

0.00 

V3 

-1.19 

-3.61 

0.00 

0.00 

0.00 

0.00 

VA 

0.56 

0.28 

0.00 

0.00 

0.15 

0.00 

V5 

-0.32 

-0.20 

0.03 

0.27 

0.00 

0.11 

V6 

-1.78 

-1.03 

0.00 

0.00 

0.01 

0.00 

V7 

-0.94 

0.42 

0.00 

0.00 

0.06 

0.00 

V8 

-6.65 

-4.78 

0.00 

0.00 

0.00 

0.00 

V9 

0.00 

-0.02 

0.15 

0.25 

0.05 

0.50 

Vio 

-1.57 

0.21 

0.00 

0.00 

0.03 

0.00 

Vll 

-0.32 

-1.04 

0.00 

0.00 

0.00 

0.03 

Vl2 

0.09 

0.03 

0.25 

0.18 

0.30 

0.12 

Vl3 

-3.27 

-4.78 

0.00 

0.00 

0.00 

0.00 

via 

2.03 

0.71 

0.00 

0.00 

0.01 

0.00 

Vl5 

0.41 

-0.25 

0.00 

0.14 

0.00 

0.11 

Vl6 

-5.52 

4.88 

0.00 

0.00 

0.00 

0.00 

Vl7 

0.08 

9.62 

0.00 

0.00 

0.00 

0.00 

Vl8 

-0.08 

0.03 

0.27 

0.17 

0.26 

0.13 

Vl9 

-13.32 

11.69 

0.00 

0.00 

0.00 

0.00 

V29 

-0.19 

0.59 

0.29 

0.00 

0.11 

0.00 

classifier  (blue  and  red  ROC  curves  of  figure  33). 

Figure  36  compares  the  ROC  curves  generated  by  the  trained  A and 
m°dels  (in  solid  lines)  and  the  corresponding  initial  models  before  training 
(in  dashed  lines).  We  notice  that  the  BW-training  improves  significantly  the  classification  perfor¬ 
mance  for  the  non-CPR  model  (blue  ROC  curves  of  figure  36).  However,  the  BW-trained  CPR 
model  slightly  underperfoms  compared  to  the  non-trained  CPR  model  in  terms  of  the  area  under 
ROC  measure.  Nonetheless,  the  BW-trained  baseline  CHMM  classifier  that  combines  the  CPR  and 
non-CPR  models  has  the  largest  area  under  ROC  curve. 


4. 3. 3. 2  Baseline  continuous  HMM  results 

Tables  11  and  12  display  the  initial  and  the  Baum- Welch  trained  A chmm  and  A chmm 
models  parameters,  respectively.  In  these  tables,  we  notice  that,  for  both  A  chmm  and  A  tlie 

BW-training  updates  resulted  in  one  dominant  component  of  the  Gaussian  mixture  (e.g.  component 
3  of  state  1  and  component  2  of  state  2  of  A cSmm)-  For  both  models,  the  dominant  component  in 


73 


Q 

CL 


CPR 
DHMM  "DHMM 
CPR 
DHMM 
CPR 
DHMM 


—Using  BW-trained  -  X 

—Using  BW-trained 

a  i 

—Using  BW-trained  X 
Using  non-trained  -  Xl9?»A 

53  DHMM  DHMM 

CPR 

Using  non-trained  A.DHMM 

^  CPR 

Usin9  non-trained  ^DHMM 


50 

PFA 


Figure  33.  ROCs  generated  by  the  BW-trained  (solid  lines)  and  initial  (dashed  lines)  DHMM  models 

each  state  is  the  one  with  the  lowest  magnitude  |u|.  However,  for  A chmmi  the  dominant  component 
is  the  one  that  has  the  highest  covariance  along  uy,  while,  for  A 2^mm,  the  dominant  component 
is  the  one  that  has  the  lowest  covariance  along  uy .  Converging  to  a  single  dominant  component, 
specially  for  A ch*mm  where  state  1  component  3  weight  is  0.92  (refer  to  tablel2),  suggests  that 
Ac^mm  state  1  Gaussian  mixture  could  have  been  modeled  by  a  lesser  number  of  components. 


TABLE  11 

Baseline  CHMM  model  parameters  of  A cf/mm 


Initial  Model  B  Means  Covariance  Coefficients 


A 

UX 

uy 

ux 

uy 

W 

Si 

^2 

cn 

-0.16 

1.09 

COV 11 

0.10 

0.13 

9n 

0.33 

si 

0.5 

0.5 

Si 

CL2 

0.57 

1.11 

COV 12 

0.20 

0.21 

912 

0.33 

S2 

0.5 

0.5 

Cl3 

0.04 

0.16 

COV  13 

0.14 

0.10 

9l  3 

0.33 

Priors 

C21 

-0.32 

-0.76 

COV  21 

0.16 

0.10 

921 

0.33 

si 

1.00 

52 

C22 

0.01 

-0.21 

COV  22 

0.10 

0.10 

922 

0.33 

S2 

0.00 

C23 

2.33 

-0.56 

COV  23 

0.33 

0.20 

923 

0.33 

Trained  Model 

B  Means 

Covariance 

Coefficients 

A 

ux 

uy 

ux 

uy 

w 

s% 

S2 

Cll 

0.23 

1.09 

COV  11 

0.32 

0.23 

9n 

0.18 

Si 

0.66 

0.34 

si 

C12 

0.21 

0.93 

COV  12 

0.29 

0.31 

912 

0.13 

S2 

0.23 

0.77 

C13 

0.09 

0.31 

COV  13 

0.20 

0.35 

913 

0.69 

Priors 

C21 

-0.20 

-0.67 

COV  21 

0.21 

0.13 

921 

0.21 

si 

1.00 

52 

C22 

-0.07 

-0.35 

COV  22 

0.18 

0.17 

922 

0.60 

S2 

0.00 

C23 

-0.10 

-0.42 

COV  23 

0.18 

0.17 

923 

0.19 

Figure  34  displays  the  observation  vectors  used  to  train  the  CPR  model  (first  cross  validation 
set).  In  this  figure,  we  show  the  A chmm  m°del  parameters  for  both  the  initial  model  (a)  and  the  BW- 
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Figure  34.  Gaussian  mixtures  of  (a)  the  initial  A chmm  m°del,  and  (b)  the  BW-trained  A ch^mm 
model 


Figure  35.  Illustration  of  testing  the  CPR  sequence  of  figure  32(a)  with  A ci^mm 


75 


TABLE  12 


Baseline  CHMM  model  parameters  of  A chmm 


Initial  Model  B  Means  Covariance  Coefficients 


A 

Ux 

uy 

UX 

uy 

W 

Sl 

52 

cn 

-0.26 

9.70 

CO^n 

59.96 

1.43 

9n 

0.33 

Si 

0.5 

0.5 

si 

Cl2 

-1.04 

0.55 

COV 12 

0.72 

0.62 

912 

0.33 

S2 

0.5 

0.5 

Cl3 

0.02 

0.06 

COV  13 

0.10 

0.10 

913 

0.33 

Priors 

C21 

0.42 

-0.25 

COV  21 

0.13 

0.14 

921 

0.33 

si 

0.50 

52 

C22 

-0.49 

-0.65 

COV  22 

0.66 

0.62 

922 

0.33 

S2 

0.50 

C23 

-0.01 

-0.05 

COV  23 

0.10 

0.10 

923 

0.33 

Trained  Model 

B  Means 

Covariance 

Coefficients 

A 

ux 

uy 

ux 

uy 

w 

si 

52 

Cll 

-3.95 

7.92 

COV  11 

25.97 

9.56 

9n 

0.01 

si 

0.40 

0.60 

Sl 

Cl2 

-0.25 

1.07 

COV  12 

1.05 

0.81 

9l  2 

0.08 

S2 

0.25 

0.75 

Cl3 

-0.01 

0.05 

COV  13 

0.17 

0.10 

913 

0.92 

Priors 

C21 

0.00 

-0.02 

COV  21 

0.14 

0.10 

921 

0.36 

Sl 

0.50 

52 

C22 

-0.44 

-0.97 

COV 22 

1.50 

1.79 

922 

0.03 

S2 

0.50 

C23 

0.00 

0.00 

COV  23 

0.14 

0.10 

923 

0.61 

trained  model  (b).  In  particular,  we  display  the  means  of  the  states  components  where  Cij  denotes 
the  center  of  component  j  of  state  i,  for  n  E  {1,2}  and  m  E  {1,2,3}.  The  corresponding  covariance 
is  represented  by  an  ellipse  with  dimensions  proportional  to  the  covariance  in  each  direction  u*  and 
uy;  state  1  Gaussian  components  are  plotted  in  red  while  state  2  in  blue;  and  color  opacity  of  each 
component  is  proportional  to  its  coefficient  in  the  Gaussian  mixture.  In  figure  34(b),  we  notice  that 
the  updated  means  shifted  overall  towards  the  center  of  the  (u^u27)  plane  in  such  a  way  that  the 
dominant  component  has  the  lowest  magnitude  \uy\. 

Figure  35  illustrates  the  testing  process  of  the  CPR  sequence  of  figure  32(a)  by  A ^Smm* 
Similarly  to  the  DHMM  model,  the  first  and  last  observation  (upward  motion)  are  closer  to  the 
means  of  the  Gaussian  mixture  of  state  1  and  thus  have  higher  probabilities  in  state  1.  Similarly, 
the  middle  observations  (downward  motion)  have  higher  probabilities  in  the  Gaussian  mixture  of 
state  2.  The  optimal  state  sequence,  assigned  by  the  Viterbi  algorithm,  to  this  test  sample  is 
11122222222221111111. 

Figure  36  compares  the  ROC  curves  generated  by  the  trained  A ^Smm,  and 

models  (in  solid  lines)  and  the  corresponding  initial  models  before  training  (in 
dashed  lines).  We  notice  that  the  BW-training  improves  significantly  the  classification  performance 
for  the  CPR  model  (green  ROC  curves  of  figure  36).  However,  the  BW-trained  non-CPR  model  un- 
derperfoms  compared  to  the  non-trained  non-CPR  model  in  terms  of  the  area  under  ROC  measure. 
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Figure  36.  ROCs  generated  by  the  BW-trained  (solid  lines)  and  initial  (dashed  lines)  CHMM  models. 


Overall,  the  BW-training  improves  the  performance  of  baseline  CHMM  classifier  that  combines  the 


CPR  and  non-CPR  models. 


4.3.4  Video  shot  classification  using  ensemble  HMM 

4.3.4. 1  Fitting  HMM  models  to  individual  sequences 

For  each  sequence  Or  £  we  build  an  HMM  model,  Ar, based  on  the  observations 

{uj,  •  •  •  ,  uJT}.  We  assume  that  each  Ar  has  two  states  and  :  one  represents  the  upward  motion 
of  the  hands  and  the  second  represents  the  downward  motion.  The  states  means  are  estimated  using: 


s[r^  =  averaqe\ ur  £  Or\u^  >  0} 

{  J  (104) 

<§2  =  average{ ur  £  Or\uvr  <  0}. 

The  remaining  parameters  (A,  B,7t)  of  Ar  are  initialized  as  follows.  The  priors,  7Ti  and  7T2, 
are  set  to  [1;  0]  for  CPR  sequences  and  [0.5;  0.5]  for  non  CPR  sequences.  The  transition  matrix  is 
initialized  using: 


A  = 


0.5  0.5 
0.5  0.5 


The  initialization  of  the  emission  probabilities  for  each  state  depends  whether  we  use  a  discrete  or 
a  continuous  HMM  model. 


•  For  the  discrete  case,  the  emission  matrix,  B,  is  initialized  based  on  the  states  and  a  code¬ 
book  of  size  M=20.  In  the  case  of  individual  non-CPR  sequences  models,  the  codewords 
{vm,ra  =  1,  •  •  •  ,M}  are  the  actual  sequence  observations  {u *,t  =  1,  •  •  •  , Tj  of  Or.  For  the 
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(a)  CPR  Sequence  0\  (b)  non-CPR  Sequence  O2 

Figure  37.  Motion  vectors  of  two  sample  sequences 


CPR  sequences  models,  we  discard  codewords  that  have  low  magnitude  (\uy\  <  0.5).  This  will 
prevent  observations  with  small  motion  vectors  (mainly  from  non-CPR  sequences)  from  having 
high  probabilities  in  those  codewords,  i.e.  in  individual  CPR  sequences  models.  Finally,  the 
emission  probabilities  are  initialized  based  on  the  distance  between  the  codewords  and  the 
states  means,  using  (103). 

•  For  the  continuous  case,  the  emission  probabilities  are  modeled  by  one  Gaussian  component 
for  each  state.  The  choice  of  a  single-component  Gaussian  mixture  is  motivated  by  the  small 
number  of  observations  and  the  uni-modality  of  the  CPR  sequences  observations.  This  as¬ 
sumption  is  also  valid  for  non-CPR  sequences  that  can  be  modeled  by  a  Gaussian  distribution 
with  a  large  variance.  Therefore,  for  each  state,  the  parameters  of  the  Gaussian  distribution 
are  the  mean  of  the  state,  computed  using  (104),  and  the  statistical  covariance  obtained  from 
the  respective  observations.  In  particular,  we  use: 

COV =  Cov{ ur  =  (u*,uy)  E  Or\uy  >  0} 

< 

COV r2^  =  Cov{ ur  =  (u^,uy)  E  Or\uy  <  0}. 

After  initialization,  the  transition  probabilities  and  the  emission  probability  parameters  of 
each  model  are  fine-tuned  using  either  the  discrete  or  continuous  version  of  the  Baum- Welch  algo¬ 
rithm  [9]  and  the  corresponding  training  observation.  As  mentioned  earlier  in  section  3.3. 1.1,  the  use 
of  only  one  observation  sequence  to  form  and  train  an  HMM  leads  to  over-fitting,  and  subsequently 
to  the  desired  property  that  the  likelihood  of  each  sequence  with  respect  to  its  model  being  higher 
than  the  other  sequences  likelihoods  in  the  same  model.  Let  {Ar}^1  be  the  set  of  trained  individual 
models  for  the  nth  crossvalidation  fold. 

Let  0\  =  {uj,---  ,uf}  and  O2  =  ,  u<f}  be,  respectively,  a  CPR  sequence  and  a 

non-CPR  sequence  from  the  subset  St™  of  the  training  data.  In  figures  37(a)  and  (b),  we  show 
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1.5i- 


0.1 


•  Training  Sequence  Observations 
+  State  1  Mean 

+  State  2  Mean 

•  Codes  with  higher  probability  in  state  1 

•  Codes  with  higher  probability  in  state  2 


(a) 


•vn  12 

•  V  „  +«V3  .V,  «VV16 


Training  Sequence  Observations 
+  State  1  Mean 
+  State  2  Mean 

•  Codes  with  higher  probability  in  state  1 

•  Codes  with  higher  probability  in  state  2 


(b) 


Figure  38.  Training  observations,  states  means,  and  codewords  of  the  models  :(a)  \^HMM ,  and  (b) 
yDHMM  th< e  sequences  0\  and  O2  of  figures  37(a)- (b) 


the  sequence  of  20  motion  features  vectors  of  0\  and  O2,  respectively.  Let  \PHMM  and  \PHMM 
denote,  respectively,  the  BW-trained  DHMM  and  CHMM  models  for  sequence  O*,  i  E  {1,2}. 


Fitting  discrete  HMM  models  to  individual  sequences:  Tables  13  and  14  show  the  DHMM 
models  parameters  for  sequences  0\  and  O2,  respectively.  For  each  model,  the  initial  and  the  Baum- 
Welch  trained  parameters  (namely  the  state  means,  the  state  transition  probabilities,  the  codewords, 
and  their  emission  probabilities  in  each  state)  are  shown.  Since  the  motion  feature  vectors  are  2- 
dimensional,  we  visualize  the  parameters  of  \^HMM  and  \^HMM  in  the  (ux;uy)  plane  in  figures 
38(a)  and  (b)  respectively.  From  tables  13  and  14  and  the  respective  figures  38(a)  and  38(b),  we 
notice  that: 


TABLE  13 

\DHMM  moqe}  parameters 


Means  Initial  A  A 


ux 

uy 

si 

S2 

si 

S2 

Si 

0.08 

0.69 

0.5 

0.5 

0.91 

0.09 

S2 

-0.14 

-0.56 

0.5 

0.5 

0.13 

0.87 

Codes 

Initial  B 

B 

Vl 

-0.22 

0.83 

0.40 

0.00 

0.08 

0.00 

V2 

0.10 

1.34 

0.00 

0.00 

0.08 

0.00 

V3 

0.48 

1.44 

0.00 

0.00 

0.08 

0.00 

VA 

0.70 

1.22 

0.00 

0.00 

0.08 

0.00 

V5 

0.50 

0.75 

0.00 

0.00 

0.08 

0.00 

V6 

-0.15 

-0.61 

0.00 

0.47 

0.00 

0.62 

v7 

-0.41 

-1.01 

0.00 

0.00 

0.00 

0.12 

Vg 

-0.28 

-0.87 

0.00 

0.12 

0.00 

0.12 

V9 

-0.17 

-0.77 

0.00 

0.42 

0.00 

0.12 

V10 

-0.18 

0.55 

0.60 

0.00 

0.50 

0.00 

Vn 

-0.36 

1.14 

0.00 

0.00 

0.08 

0.00 
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TABLE  14 


yDHMM  mociei  parameters 


Means 

Initial  A 

A 

ux 

uy 

si 

52 

si 

52 

Sl 

0.12 

0.05 

0.5 

0.5 

0.53 

0.47 

52 

0.01 

-0.06 

0.5 

0.5 

0.46 

0.54 

Codes 

Initial  B 

B 

Vl 

-0.04 

-0.03 

0.01 

0.08 

0.01 

0.08 

V2 

0.04 

-0.01 

0.03 

0.07 

0.03 

0.07 

V3 

0.01 

0.00 

0.02 

0.08 

0.02 

0.07 

V4 

0.06 

0.06 

0.11 

0.02 

0.09 

0.01 

V5 

-0.02 

-0.10 

0.01 

0.08 

0.01 

0.09 

-0.06 

0.03 

0.04 

0.06 

0.04 

0.06 

V7 

0.08 

0.00 

0.08 

0.03 

0.08 

0.03 

Vs 

-0.08 

-0.14 

0.01 

0.07 

0.02 

0.08 

V9 

-0.09 

-0.03 

0.02 

0.07 

0.02 

0.07 

V10 

-0.05 

-0.01 

0.02 

0.08 

0.02 

0.08 

Vll 

-0.01 

0.01 

0.03 

0.07 

0.03 

0.07 

V\2 

0.02 

0.03 

0.06 

0.05 

0.06 

0.04 

Vl3 

0.01 

-0.03 

0.00 

0.09 

0.00 

0.09 

via 

0.05 

0.04 

0.09 

0.02 

0.08 

0.02 

Vys 

-0.03 

-0.27 

0.01 

0.06 

0.02 

0.08 

VlQ 

0.09 

0.00 

0.10 

0.02 

0.09 

0.02 

V17 

0.16 

-0.03 

0.10 

0.02 

0.09 

0.02 

Vis 

0.34 

0.03 

0.08 

0.01 

0.10 

0.01 

VlQ 

0.36 

0.04 

0.07 

0.00 

0.10 

0.01 

V20 

0.18 

0.12 

0.11 

0.01 

0.10 

0.01 

1.  The  means  and  the  codewords  of  \^HMM  have  higher  magnitude  motion  vectors  compared  to 
the  non-CPR  model  \2HMM. 

2.  The  transition  matrix  A  of  \®HMM  has  a  higher  diagonal  entries  than  that  of  \^HMM . 
Precisely,  in  \^HMM ?  the  entry  an  =  0.91  is  exactly  the  ratio  of  (1)  the  number  of  state 
1  self-transitions  (nine  state  1  to  state  1  transitions  as  it  can  be  seen  in  figure  37(a))  to  (2) 
the  total  number  of  transitions  from  state  1  (eleven  transition  occurrences).  Similarly  for  the 
entry  a22,  that  ratio  is  7  to  8  as  it  can  be  deduced  from  the  same  figure  37(a).  However,  for 
yDHMM ,  the  transitions  from  state  1  to  either  state  1  or  state  2  (and  vice  versa)  are  almost 
equiprobable. 

3.  For  the  emission  probabilities,  we  notice  that,  for  the  BW-trained  \^HMM ?  one  codeword  (vio 
in  state  1  and  vq  in  state  2)  has  the  highest  emission  probability  while  few  other  codewords 
({fi,  •  •  • ,  t>5,  vu}  in  state  1  and  {v7,-  •  •  ,^9}  in  state  2)  have  the  same  relatively  low  probability. 
In  both  states,  the  codeword  with  highest  probability  is  the  one  closest  to  the  state  mean  (refer 
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to  table  13  and  to  the  subsequent  illustration  of  the  \^HMM  parameters  in  the  (ux\  u\ !)  plane  in 
figure  38(a)).  For  the  BW-trained  non-CPR  model  \^HMM  ?  all  the  codewords  have  relatively 
low  magnitude  motion  vector  and  their  respective  emission  probabilities  in  each  state  are  low 
and  comparable  (<  0.10  as  can  be  seen  in  table  14).  In  conclusion,  the  emission  probability 
density  function  of  each  state  of  \^HMM  can  be  approximated  by  a  discrete  normal  distribution 
centered  around  the  mean  of  that  state.  Similarly,  the  emission  probability  density  function  of 
each  state  of  \^HMM  can  be  approximated  by  a  discrete  uniform  distribution.  These  findings 
will  be  discussed  in  the  following  section  for  the  continuous  case. 


Fitting  continuous  HMM  models  to  individual  sequences:  Similarly  to  the  discrete  case, 
we  use  the  same  sequences  0\  and  O2  (shown  in  figure  37)  to  fit  two  CHMM  models:  \^HMM  and 
yCHMM '  rppe  and  BW-updated  parameters  of  \^HMM  and  \^HMM  are  shown  in  tables  15 

and  16,  respectively. 


TABLE  15 


yCHMM  moqe}  parameters 


Initial  Model 
A 


Si 

52 

Sl 

0.5 

0.5 

^2 

0.5 

0.5 

Trained  Model 

A 

51 

52 

si 

0.90 

0.10 

52 

0.12 

0.88 

B  Means  Covariance  Coefficients 


si 

cn 

ux 

0.08 

uy 

0.69 

COV\\ 

ux 

0.108 

uy 

0.256 

9m 

w 

1.0 

52 

C21 

-0.14 

-0.56 

COV  21 

0.021 

0.094 

921 

1.0 

B  Means 

Covariance 

Coefficients 

ux 

uy 

ux  uy 

w 

si  cn  0.08 

0.73 

cov  11  0.105  0.228 

9u  1-0 

52  C21  -0.12 

-0.50 

cov  21  0.019  0.112 

921  1-0 

We  notice  that  the  BW-training  does  not  change  the  CHMM  model  parameters  significantly 
except  for  the  transition  matrices.  For  the  covariance  values,  it  is  worth  noting  that  variance  along 
uy  is  higher  than  the  variance  along  u x  direction. 


4. 3. 4. 2  Similarity  matrix  computation 

In  the  first  step  of  the  eHMM,  we  built  a  set  {Ar}^1  of  trained  individual  HMM  models 
for  each  crossvalidation  fold.  In  this  step,  we  test  each  observation  sequence  Oj  in  each  model 
A*,  for  1  <  i,  j  <  Rn.  This  test  produces  (1)  a  log-likelihood  value  L ^  =  log(Pr(Oj\Xi))  that  is 
the  probability  of  sequence  Oj  being  generated  by  A^  and  (2)  a  path  pij  that  represents  the  optimal 
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TABLE  16 


yCHMM  moc[ei  parameters 


Initial  Model 
A 


Si 

^2 

Si 

0.5 

0.5 

S2 

0.5 

0.5 

Trained  Model 

A 

s% 

52 

si 

0.28 

0.72 

S2 

0.30 

0.70 

B  Means  Covariance  Coefficients 


si 

cn 

ux 

0.12 

uy 

0.05 

COV 11 

ux 

0.025 

uy 

0.010 

9n 

W 

1.0 

52 

C21 

0.01 

-0.06 

COV  21 

0.010 

0.010 

d21 

1.0 

B  Means 

Covariance 

Coefficients 

ux 

uy 

ux  uy 

w 

si  cn  0.08 

0.02 

cov  11  0.016  0.010 

511  1-0 

52  C21  0.04 

-0.03 

cov  21  0.014  0.010 

d21  1-0 

(most  likely)  sequence  of  states  of  A^  that  would  have  generated  Oj .  This  path  is  computed  using  the 
Viterbi  algorithm  [8].  We  define  the  path  mismatch  penalty,  denoted  P^-,  as  the  edit  distance  [51] 
between  the  optimal  paths  pij  and  pa,  that  resulted  from  testing  Oj  in  A i  and  Oi  in  \  respectively. 
Given  the  log-likelihood  and  the  path  mismatch  penalty  matrices,  we  define  the  similarity  matrix, 
S,  as  a  weighted  sum  of  L  and  P  using  a  mixing  factor  a  E  [0, 1]  (refer  to  equation  (74)  in  section 
3. 3. 1.3). 


Similarity  matrix  computation  using  discrete  HMMs:  To  illustrate  the  similarity  matrix 
computation  using  discrete  HMMs,  we  reconsider  the  sample  sequences  0\  and  O2 ,  shown  in  figure 
37,  and  their  respective  DHMM  models  \^HMM  and  \^HMM ,  displayed  in  tables  13  and  14.  In 
figure  39,  we  illustrate  the  testing  of  the  sequences  0\  and  02  in  \®HMM  and  \®HMM . 

The  log-likelihoods  and  the  consequent  optimal  Viterbi  paths  obtained  by  testing  0\  and 
O2  in  \?HMM  and  \^HMM  are  summarized  in  tables  17  and  18. 

TABLE  17 

Log-likelihoods  of  the  two  sequences  in  figure  37  in  the  models  generated  by  these  sequences 


Lij 

yDHMM 

Oi 

-9.27 

-21.50 

o2 

-38 

-3.21 

From  the  paths  in  table  18,  we  use  the  ’’edit”  distance  measure  to  deduce  the  path  mismatch 
penalty  terms  (shown  in  table  19)  for  sequences  0\  with  regards  to  O2  and  vice  versa.  Figures 
40(a)  and  40(b)  show  the  log-likelihood  and  path  mismatch  penalty  matrices  for  all  the  training 
sequences  Drrn-  Figures  40(c)  and  40(d)  show  the  resulting  similarity  matrix  using  a  =  .5  and 
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•v11 

•V1 

•  V 

•  v_ 

*V4 

+  #V5 

10 

%)  - 

- 

+  State  1  Mean 

^6 

+  State  2  Mean 

•  V  *9 

•  Codes  with  higher  probability  in  state  1 

.Vy  8 

•  Codes  with  higher  probability  in  state  2 

Test  Sequence  Observations 

-1.5 

-0.5  0  ^  0.5 

(c)  Pr(02 1 X^HMM) 


(d)  Pr(02 \\°HMM) 


Figure  39.  Illustration  of  the  testing  of  sequences  0\  and  02  in  \®HMM  and  \®HMM 


TABLE  18 

Viterbi  paths  resulting  from  testing  0\  and  02  in  models  \^HMM  and  \^HMM .  p-  refers  to  the 
optimal  path  obtained  by  testing  sequence  i  with  model  j. 


Ul 

u2o 

Pll 

1 

1 

1 

1 

1 

1 

1 

1 

2 

2 

2 

2 

2 

2 

2 

2 

1 

1 

1 

1 

P21 

1 

1 

1 

1 

2 

1 

1 

2 

2 

1 

1 

1 

2 

1 

2 

1 

2 

1 

1 

1 

Pl2 

2 

2 

2 

1 

1 

1 

1 

1 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

P22 

2 

2 

2 

1 

2 

2 

1 

2 

2 

2 

2 

2 

2 

1 

2 

1 

1 

1 

1 

1 

a  =  .1  respectively.  In  these  matrices,  the  indices  are  rearranged  so  that  the  first  entries  correspond 
to  the  CPR  sequences  '&Trn‘  and  the  latter  ones  correspond  to  the  non-CPR  sequences  ®T?nR- 
these  figures,  dark  pixels  correspond  to  small  values  of  the  log-likelihood  and  path  mismatch  penalty 
and  bright  pixels  correspond  to  larger  entries  of  the  corresponding  matrices.  Note  that  in  the  case 
of  the  log-likelihood  matrix  of  figure  40(a),  the  diagonal  blocks  are  brighter  than  the  off-diagonal 
blocks.  Conversely,  in  the  path  mismatch  penalty  matrix  of  figure  40(b)),  the  diagonal  blocks  are 
darker  than  the  off-diagonal  blocks.  Thus,  both  the  log-likelihood  and  the  path  mismatch  penalty 
provide  complementary  measures  for  the  similarity  between  signatures.  This  was  the  motivation  for 
combining  both  matrices  into  the  similarity  S  using  the  mixing  factor  a  =  0.5  in  figure  40(c)  and 
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TABLE  19 


Path  mismatch  penalty  of  the  two  sequences  in  figure  37  in  the  models  generated  by  these  sequences 


Pij 

\DHMM 

A2 

Oi 

0 

9 

02 

8 

0 

Figure  40.  Log-likelihood,  path  mismatch  penalty,  and  similarity  matrices  for  the  sequences  in  D Tm 

a  =  0.1  in  figure  40  (d).  In  the  next  step  of  the  eHMM,  we  use  the  similarity  matrix  of  figure  40(c) 
to  partition  'Drm  into  groups  of  similar  sequences. 


4. 3. 4. 3  Clustering  results 

The  similarity  matrix  S  of  figure  40(c)  is  transformed  to  a  symmetric  distance  matrix  D 
using  (77).  Any  relational-based  clustering  algorithm  can  be  applied,  using  D,  to  partition  the 
training  data  subset  'Drm  into  K  clusters.  In  our  experiments  in  this  chapter,  we  use  the  relational 
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Figure  41.  Fuzzy  membership  of  the  training  sequences  in  each  cluster:  (a)  initial  membership  (b) 
after  FLeCK  convergence 


clustering  with  learnable  cluster  dependent  kernels  (FLeCK)  clustering  algorithm  [52]  with  K  =  4. 

Figure  41  shows  the  fuzzy  membership  of  the  training  sequences  in  each  cluster.  Initially, 
a  random  fuzzy  partition  matrix  C/(°)  of  size  (. Rn  x  K )  was  created.  Those  initial  membership 
values  are  shown  in  figure  41(a).  The  FLeCK  algorithm  converges  when  the  membership  values 
do  not  change  significantly  from  one  iteration  to  the  next  (|| (U^  —  U^t~1^))\\  <  10-6).  In  our 
experiment,  FLeCK  converged  after  118  iterations.  The  resulting  fuzzy  memberships  are  shown  in 
figure  41(b).  The  variance  of  each  cluster  is  displayed  in  table  20.  As  shown  in  figure  41(b),  the 


TABLE  20 

Variance  in  each  cluster  after  FLeCK  convergence 


Cluster 

^2 

^3 

cr4 

Variance 

0.13 

0.11 

0.13 

0.11 

first  sequences  of  £>Trn  have  high  fuzzy  membership  in  clusters  1  and  3,  while  the  latter  ones  have 
higher  memberships  in  clusters  2  and  4.  Recall  that  the  indices  of  'Drm  are  rearranged  so  that 
the  first  elements  correspond  to  the  CPR  sequences  and  fhe  laRer  ones  correspond  to  the 

non-CPR  sequences  i.e.  Dt™  =  ®TrnPjR }•  Thus,  clusters  1  and  3  are  composed 

mainly  of  CPR  sequences  and  few  non-CPR  sequences  and  clusters  2  and  4  comprise  only  non-CPR 
sequences.  Figure  42  shows  the  partition  of  the  training  data  in  terms  of  the  number  of  CPR  and 
non-CPR  sequences  in  each  cluster.  In  figure  43,  we  visualize  the  actual  sequences  belonging  to 
each  cluster  (CPR  sequences  are  displayed  in  red  while  non-CPR  sequences  are  shown  in  blue).  In 
particular,  for  each  sequence,  we  plot  only  the  uy  component  of  each  observation.  We  notice  that 
all  but  one  of  the  CPR  sequences  belonging  to  cluster  1  start  with  low  magnitude  uy  up  to  the 
5th  observation.  Most  of  these  sequences  have  a  high  positive  uy  value  at  observations  {6,  •  •  •  ,  10}, 
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Figure  42.  Distribution  of  the  sequences  in  each  cluster  per  class  (CPR  vs.  non-CPR) 

negative  uy  values  around  observations  {12,  •  •  •  ,  16},  and  then  back  to  low  magnitude  uy  values  at 
the  last  observations.  Cluster  3  comprises  22  CPR  sequences  and  1  non-CPR  sequence.  For  all  these 
sequences,  including  the  non-CPR  one,  the  first  and  last  observations  have  high  uy  values  (upward 
motion)  while  the  middle  observations  have  negative  uy,s  (downward  motion).  Finally,  Clusters  2 
and  4  are  composed  mainly  of  low-magnitude  uy  value  sequences  with  random  number  of  transition 
between  negative  and  positive  uy’s. 

The  above  figures  illustrate  the  purpose  of  step  2  of  the  proposed  eHMM  approach,  i.e. 
grouping  similar  signatures  into  clusters.  To  summarize,  using  the  distance  matrix  D  of  (77),  the 
R  sequences  are  clustered  into  K  clusters.  The  objective  of  this  clustering  step  is  to  group  similar 
observations  in  the  log-likelihood  space.  The  observations  forming  each  cluster  are  then  used  to 
learn  multiple  HMM  models. 

4. 3. 4. 4  Clusters  models  initialization  and  training 

For  each  cluster  in  figure  42,  one  or  two  HMM  models  are  built  based  on  the  elements 
belonging  to  that  cluster:  one  for  CPR  sequences  and/or  one  for  non-CPR  sequences.  In  particular, 
the  initialization  of  the  priors  and  transition  matrices  for  each  model  is  performed  by  averaging  the 
corresponding  parameters  of  the  individual  models  of  the  sequences  belonging  to  that  cluster.  We 
assume  that  each  model  has  N  =  2  states.  The  initialization  of  the  emission  probabilities  for  the 
discrete  case  is  straightforward.  That  is,  the  states’  means,  sn,  of  the  cluster  model  are  obtained  by 
averaging  the  states  means  of  the  individual  models.  Finally,  the  codes  vm,  m  —  1,  •  *  •  ,  M,  of  the 
cluster  model  are  obtained  by  clustering  the  codes  of  the  individual  models  into  M  =  20  clusters. 

For  the  continuous  case,  we  further  assume  that  each  state  has  three  Gaussian  components. 
For  each  cluster  model,  the  means  and  covariances  of  state  n  components  are  obtained  by  clustering 
the  state  n  observations  of  the  sequences  belonging  to  the  cluster  into  three  clusters  using  k-means 
algorithm  [23].  Precisely,  the  mean  of  each  component  is  the  center  of  one  of  the  resulting  clusters 
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(a)  Cluster  1 


Observation 


(b)  Cluster  2 


(c)  Cluster  3 


(d)  Cluster  4 


Figure  43.  Sequences  belonging  to  each  cluster,  only  uy  motion  vector  component  is  plotted  for  each 
sequence’  observation. 


and  the  covariance  is  estimated  using  the  observations  belonging  to  that  same  cluster. 

After  initialization,  and  depending  on  the  cluster  size  and  homogeneity,  one  of  the  train¬ 
ing  schemes  described  earlier,  is  devised.  Let  \CpR,DHMM  anq  A ~cpr,dhmm  qeno^e  ^he  trained 
DHMM  models  for  the  CPR  and  non-CPR  sequences  belonging  to  cluster  k. 


Clusters  DHMM  models  initialization  and  training:  In  this  case,  we  simply  use  the  notation 
\%PR  and  X^CPR  to  refer  to  the  CPR  and  non-CPR  cluster  k  models.  Figure  44(a)  shows  the 
CPR  training  sequences  of  cluster  1  and  the  parameters  of  the  model,  A^pp,  built  using  those 
sequences.  Similarly,  in  figure  44(b),  we  show  the  non-CPR  sequences  observations  of  cluster  1 
and  their  respective  X^CPR  parameters.  We  notice  that  the  codes  of  Af PR  have  relatively  high 
magnitude  motion  vectors.  The  codes  closest  to  the  mean  of  each  state  have  higher  B  values  in 
that  state.  However,  for  the  Xi°PR  model,  the  majority  of  the  codes  have  relatively  low  magnitude 
motion  vectors.  A  few  codes  of  Aj"CPP  have  very  high  magnitude  motion  vectors  but  their  emission 
probability  is  very  low.  This  suggests  that  those  codes  represent  observations  from  outlier  non-CPR 
sequence (s)  in  the  training  data. 

Cluster  2  is  composed  uniquely  of  non-CPR  sequences.  Thus,  we  set  its  corresponding  A^pp 
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Figure  44.  Training  sequences  and  XiPR  DHMM  model  parameters  of  cluster  1 
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to  0  and,  by  definition,  returns  a  default  likelihood  value  minProb  =  —40.  In  figure  45,  we  show  the 
non-CPR  sequences  belonging  to  cluster  2  and  the  XiCPR  model  parameters.  As  expected,  XiCPR 
have  relatively  low  magnitude  codes  and  states  means,  with  codes  closest  to  the  mean  of  each  state 
have  higher  B  values  in  that  state. 

Cluster  3  is  composed  of  mainly  CPR  sequences  with  only  1  non-CPR  sequence.  In  this 
case,  the  non-CPR  cluster  3  model,  X^CPR,  is  defined  as  the  individual  model  of  that  sequence. 
The  CPR  model,  X^PR  and  the  CPR  sequences  of  cluster  3  are  shown  in  figure  46. 
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Figure  45.  Training  sequences  and  A2''6  ,>R  DHMM  model  parameters  of  cluster  2 
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Figure  46.  Training  sequences  and  X^PR  DHMM  model  parameters  of  cluster  3 


4. 3. 4. 5  Clusters  models  performance 


In  Figure  47,  we  plot  the  confidence  of  the  training  sequences  in  the  four  cluster  models. 
Recall  that,  the  confidence  of  a  sequence  in  each  cluster  is  defined  1  as  the  difference  between  its 
likelihood  in  the  CPR  cluster  model  and  its  likelihood  in  the  non-CPR  cluster  model.  As  expected, 
the  highest  confidences  occur  when  testing  the  sequences  of  clusters  1  and  3  in  their  respective 
cluster  models.  Similarly,  sequences  belonging  to  clusters  2  and  4  have  higher  confidence  values  in 
cluster  models  2  and  4.  However,  few  sequences  of  cluster  1  have  higher  confidences  in  models  2  and 
4  than  in  models  1  and  3.  Those  sequences  are  actually  non-CPR  sequences  belonging  to  cluster  1 
but  were  assigned  high  confidences  by  the  non-CPR  models.. 

Since  the  output  of  each  cluster  model  is  a  confidence  value,  we  can  treat  them  as  individual 
classifiers  and  evaluate  their  performance  on  the  training  data.  Figure  48  shows  the  ROCs  of  each 
cluster  model  on  the  training  data  S>Trn- 

4. 3. 4. 6  eHMM  fusion  results 

In  the  previous  steps  of  the  eHMM,  we  initialized  and  learned  multiple  models  for  the 
different  clusters.  Each  training  sequence  (Or  G  ®Tm)  is  tested  in  the  K  models  and  is  assigned 
confidence  value.  Let  fr  denote  the  K-dimensional  vector  of  confidences  assigned  by  A&  to  Or.  The 
multiple  responses  of  the  different  models  need  to  be  combined  into  a  single  confidence  value  using 
one  of  the  following  combination  methods. 

•  Artificial  Neural  Network:  A  single  layer  perceptron  with  no  hidden  layers  taking  {fi, . . . ,  f Rn} 
as  input.  Using  the  standard  backpropagation  algorithm  [64]  ,  the  ANN  is  trained  to  output 
the  corresponding  labels  {2/1 , . . .  ,yRn}  of  the  training  data.  In  other  words,  the  ANN  weights 
are  optimized  to  minimize  the  misclassification  error  on  the  training  data.  In  table  21,  we 
show  the  weights  of  each  node  in  the  ANN’s  layer. 

•  Mixture  of  Experts:  A  one-level  HME  with  branching  factor  of  two  is  used.  The  input  to 
the  HME  is  a  (K+l)-dimensional  vector  of  the  cluster  models  confidences  plus  an  intercept. 
The  HME  parameters  are  updated  using  the  expectation-maximization  [26]  algorithm  and 
the  labeled  training  data.  These  parameters  are  the  gating  network  weights,  vq,  at  the  non¬ 
terminal  node  and  the  experts  weights,  vn  and  vi2,  at  the  leaf  nodes.  The  EM-learned 

parameters  are  reported  in  table  22. 

1  refer  to  equation  (101)  in  section  4.2.4. 1  for  the  formal  definition. 
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Sequences  in  Cluster  1  1  Cluster  2  1  Cluster  3 

(a)  Using  initial  models 


—Model  1 
—Model  2 
-Model  3 
—Model  4 


Cluster  4 


(b)  Using  BW-trained  models 


Figure  47.  Confidence  of  the  training  sequences  of  each  cluster  in  all  cluster  models 


•  Algebraic  methods:  We  simply  use  algebraic  operations  such  as  the  average,  or  the  maximum 
to  combine  the  confidences  outputted  by  the  K  models.  This  method  has  no  parameters  and 
is  not  trainable.  In  this  case,  the  extra  step  of  computing  the  confidences  of  the  training  data 
is  not  needed. 


Figure  49  shows  the  ROCs  of  the  ANN,  HME,  and  SUM  fusion  using  the  training  data. 
To  evaluate  the  fusion  results,  we  juxtapose  the  ROC  of  the  best  cluster  model  classifier  (cluster 
model  1  from  figure  48).  We  notice  that  the  SUM  fusion  performs  poorly  since  it  just  averaged  the 
confidence  assigned  by  cluster  1  model  with  the  confidences  assigned  by  the  other  weaker  clusters 
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Figure  48.  ROCs  generated  by  the  clusters  models  using  the  training  data 


TABLE  21 

Weights  and  bias  of  the  single-layer  ANN  after  backpropagation  training 


Model 

Weight 

1 

0.28 

2 

0.75 

3 

0.91 

4 

-0.26 

Bias 

-0.3 

(2,  3,  and  4)  models  classifiers.  The  ANN  fusion  ROC  is  slightly  better  than  the  best  cluster  model 
classifier.  The  HME  outperforms  significantly  the  other  fusion  methods  and  the  cluster  1  model 
classifier.  However  this  could  be  due  to  overfitting.  The  shape  of  the  HME  ROC  suggests  that  the 
HME  output  is  overfit  to  the  data  and  tends  to  be  discrete  (0  for  most  of  the  training  non-CPR 
sequences  and  1  for  most  of  the  training  CPR  sequences). 

4. 3. 4. 7  eHMM  performance 

Recall  that  we  use  a  4- fold  cross  validation  setting.  For  the  nth  fold,  a  subset  2)j.rn,  of  size 
Rn,  is  used  for  training  and  another  subset  of  size  Qn  =  N  —  Rn,  is  used  for  testing.  All 

the  eHMM  steps  detailed  in  the  previous  sections  are  repeated  using  the  subset  Dj>rn  for  each  cross 
validation  fold.  The  training  step  results  in  the  mixture  models  {A^  }*£_]_,  and  an  ANN-  or  HME- 
based  fusion  model,  Hn,  for  each  crossvalidation  fold  n,  n  =  1,  •  •  •  ,4.  A  new  sequence  Oq  G  ®Tst 
will  be  tested  in  the  models  {A^}.  Depending  on  the  type  and  the  contents  of  the  clusters,  one 
or  more  models  may  have  strong  response  to  the  test  sequence  Oq.  The  multiple  responses  of  the 
different  models  will  be  combined  into  a  single  confidence  value  using  Hn.  The  label  of  Og,  CPR 
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TABLE  22 


HME  gating  network  and  experts  network  parameters  after  EM  training 


Model 

v0 

vn 

Vl2 

1 

4.16 

-0.33 

7.6 

2 

5.20 

0.07 

41.23 

3 

5.62 

-0.04 

-5.83 

4 

-2.26 

0.03 

-13.98 

Intercept 

190.00 

-0.96 

368.64 

Figure  49.  ROCs  of  the  fusion  methods  and  the  best  cluster  model  (from  figure  48)  using  the  training 
data 

or  non-CPR,  is  used  only  as  a  ground-truth  to  evaluate  the  performance  of  the  eHMM.  In  figure 
50,  we  report  the  ROCs  of  the  eDHMM  using  ANN  and  HME  fusion,  and  the  ROC  of  the  baseline 
DHMM  ROCs  provided  earlier  in  figure  33  in  section  4.3.3.  The  ROCs,  generated  using  the  same 
4-fold  crossvalidation  partition,  show  that  the  eDHMM  outperforms  the  baseline  DHMM  and  that 
the  ANN  fusion  method  is  better  suited  for  the  eDHMM.  This  may  be  due  to  the  HME  optimization 
being  overfit  to  the  training  data  (refer  to  figure  49). 

In  figure  51,  we  compare  the  ROCs  of  the  eDHMM  using  ANN  and  HME  fusion  to  the 
ROC  of  the  baseline  CHMM  provided  earlier  in  figure  36  in  section  4.3.3.  These  ROCs  show  that 
the  eCHMM  outperfoms  the  baseline  CHMM.  However,  the  improvement  of  the  eCHMM  over  the 
baseline  CHMM  is  not  a  significant  as  in  the  discrete  case.  This  can  be  seen  in  table  23,  where  we 
repot  the  Are  Under  Curve  of  all  the  ROCs  of  figures  50  and  51.  Overall,  we  can  conclude  that 
the  continuous  eHMM  is  better  suited  for  the  data  at  hand,  i.e  identifying  CPR  sequences  in  video 
simulating  medical  crises  using  motion-based  features. 
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Figure  50.  ROCs  of  the  eDHMM  and  the  baseline  DHMM  classifiers  using  4-fold  crossvalidation 
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Figure  51.  ROCs  of  the  eCHMM  and  the  baseline  CHMM  classifiers  using  4- fold  cross  validation 


4.4  Chapter  summary 

In  this  chapter,  the  proposed  eHMM  classifier  is  applied  and  evaluated  on  the  classification  of 
CPR  sequences  .  First,  we  gave  an  overview  of  the  preprocessing  and  extraction  of  CPR  sequences 
from  the  simulation  videos.  Then,  we  outlined  the  architecture  of  the  eHMM  classifier  and  its 
intermediate  steps.  Finally,  we  compared  the  classification  results  of  the  eHMM  to  the  baseline 
HMM.  The  experiments  show  that  the  proposed  eHMM  intermediate  results  are  inline  with  the 
expected  behavior  and  that,  overall,  the  ensemble  HMM  outperforms  the  two-model  baseline  HMM. 
The  results  show  also  that  both  the  HME  and  ANN  fusion  methods  outperform  the  individual  cluster 
models  and  simple  combination  methods. 
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TABLE  23 

Area  Under  ROC  Curve  (AUC)  of  the  eHMM  and  the  baseline  HMM  classifiers 


Classifier 

AUC 

eDHMM  using  ANN 

538 

eDHMM  using  HME 

293 

bDHMM 

222 

eCHMM  using  ANN 

598 

eCHMM  using  HME 

611 

bCHMM 

574 
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CHAPTER  5 


APPLICATION  TO  LANDMINE  DETECTION 

In  this  chapter,  the  proposed  ensemble  HMM  classifier  is  evaluated  using  real  datasets  for 
landmine  detection.  The  proposed  eHMM  learning  steps  are  individually  analyzed  and,  ultimately, 
the  eHMM  classifier’s  performance  is  compared  against  the  baseline  HMM  classifier  and  other  state- 
of-the-art  algorithms. 

5.1  Introduction 

Detection,  localization,  and  subsequent  neutralization  of  buried  antipersonnel  (AP)  and 
antitank  (AT)  landmines  is  a  worldwide  humanitarian  and  military  problem.  The  latest  statistics 
[77]  show  that  in  2012,  a  total  of  3,638  casualties  from  mines  were  recorded  in  62  countries  and 
areas,  including  1,066  people  killed  and  2,552  injured.  In  fact,  the  vast  majority  (78%)  of  recorded 
landmine  casualties  were  civilians.  Detection  and  removal  of  landmines  is  therefore  a  significant 
problem,  and  has  attracted  several  researchers  in  recent  years.  One  challenge  in  landmine  detection 
lies  in  plastic  or  low  metal  mines  that  cannot  or  are  difficult  to  detect  by  traditional  metal  detectors. 
Varieties  of  sensors  have  been  proposed  or  are  under  investigation  for  landmine  detection.  The 
research  problem  for  sensor  data  analysis  is  to  determine  how  well  signatures  of  landmines  can  be 
characterized  and  distinguished  from  other  objects  under  the  ground  using  returns  from  one  or  more 
sensors.  Ground  Penetrating  Radar  (GPR)  offers  the  promise  of  detecting  landmines  with  little  or 
no  metal  content.  Unfortunately,  landmine  detection  via  GPR  has  been  a  difficult  problem  [78,  79]. 
Although  systems  can  achieve  high  detection  rates,  they  have  done  so  at  the  expense  of  high  false 
alarm  rates.  The  key  challenge  to  mine  detection  technology  lies  in  achieving  a  high  rate  of  mine 
detection  while  maintaining  low  level  of  false  alarms.  The  performance  of  a  mine  detection  system 
is  therefore  commonly  measured  by  a  receiver  operating  characteristics  (ROC)  curve  that  jointly 
specifies  rate  of  mine  detection  and  level  of  false  alarm. 

Automated  detection  algorithms  can  generally  be  broken  down  into  four  phases:  pre-processing, 
feature  extraction,  confidence  assignment,  and  decision-making.  Pre-processing  algorithms  perform 
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tasks  such  as  normalization  of  the  data,  corrections  for  variations  in  height  and  speed,  removal  of 
stationary  effects  due  to  the  system  response,  etc.  Methods  that  have  been  used  to  perform  this 
task  include  wavelets  and  Kalman  filters  [80],  subspace  methods  and  matching  to  polynomials  [81], 
and  subtracting  optimally  shifted  and  scaled  reference  vectors  [82].  Feature  extraction  algorithms 
reduce  the  pre-processed  raw  data  to  form  a  lower- dimensional,  salient  set  of  measures  that  represent 
the  data.  Principal  component  (PC)  transforms  are  a  common  tool  to  achieve  this  task  [83,  84]. 
Other  feature  analysis  approaches  include  wavelets  [85] ,  image  processing  methods  of  derivative  fea¬ 
ture  extraction  [86],  curve  analysis  using  Hough  and  Radon  transforms  [87],  as  well  as  model-based 
methods.  Confidence  assignment  algorithms  can  use  methods  such  as  Bayesian  [87],  hidden  Markov 
Models  [86,  88,  33],  fuzzy  logic  [89],  rules  and  order  statistics  [90],  neural  networks,  or  nearest  neigh¬ 
bor  classifiers  [91,  92],  to  assign  a  confidence  that  a  mine  is  present  at  a  point.  Decision-making 
algorithms  often  post-process  the  data  to  remove  spurious  responses  and  use  a  set  of  confidence 
values  produced  by  the  confidence  assignment  algorithm  to  make  a  final  mine/no- mine  decision. 

In  [86,  88],  hidden  Markov  modeling  was  proposed  for  detecting  both  metal  and  nonmetal 
mine  types  using  data  collected  by  a  moving- vehicle-mounted  GPR  system  and  has  proved  that  HMM 
techniques  are  feasible  and  effective  for  landmine  detection.  The  initial  work  relied  on  simple  gradient 
edge  features.  Subsequent  work  used  an  edge  histogram  descriptor  (EHD)  approach  to  extract 
features  from  the  original  GPR  signatures.  The  baseline  HMM  classifier  consists  of  two  HMM  models, 
one  for  mine  and  one  for  background.  The  mine  (background)  model  captures  the  characteristics  of 
the  mine  (background)  signatures.  The  model  initialization  and  subsequent  training  are  based  on 
averaging  over  the  training  data  corresponding  to  each  class. 

5.2  Data  preprocessing  and  pre-screening 
5.2.1  GPR  data 

The  input  data  consists  of  a  sequence  of  raw  GPR  signatures  collected  by  a  NIITEK  Inc. 
landmine  detection  system  comprising  a  vehicle- mounted  GPR  array  [3]  (see  figure  52).  The  NIITEK 
GPR  collects  51  channels  of  data.  Adjacent  channels  are  spaced  approximately  five  centimeters  apart 
in  the  cross-track  direction,  and  sequences  (or  scans)  are  taken  at  approximately  five  centimeter 
down-track  intervals.  The  system  uses  a  V-dipole  antenna  that  generates  a  wide-band  pulse  ranging 
from  200  MHz  to  7  GHz.  Each  A-scan,  that  is,  the  measured  waveform  that  is  collected  in  one 
channel  at  one  down-track  position,  contains  416  time  samples  at  which  the  GPR  signal  return  is 
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Figure  52.  NIITEK  vehicle  mounted  GPR  system  [3] 

recorded.  Each  sample  corresponds  to  roughly  8  picoseconds.  We  often  refer  to  the  time  index 
as  depth  although,  since  the  radar  wave  is  traveling  through  different  media,  this  index  does  not 
represent  a  uniform  sampling  of  depth.  Thus,  we  model  an  entire  collection  of  input  data  as  a  three- 
dimensional  matrix  of  sample  values,  S(z,  x,y),z  =  1,  •  •  •  ,  416;  x  =  1,  •  •  •  ,  51;  y  =  1,  •  •  •  ,  Ns ,  where 
Ns  is  the  total  number  of  collected  scans,  and  the  indices  z,x, and  y  represent  depth,  cross-track 
position,  and  down-track  positions  respectively.  A  collection  of  scans,  forming  a  volume  of  data,  is 
illustrated  in  figure  53. 

Figure  54  displays  several  B-scans  (sequences  of  A-scans)  both  down-track  (formed  from  a 
time  sequence  of  A-scans  from  a  single  sensor  channel)  and  cross-track  (formed  from  each  channel’s 
response  in  a  single  sample).  The  surveyed  object  position  is  highlighted  in  each  figure.  The  objects 
scanned  are  (a)  a  high-metal  content  antitank  mine,  (b)  a  low-metal  antipersonal  mine,  and  (c)  a 
wood  block. 

5.2.2  Data  preprocessing 

Preprocessing  is  an  important  step  to  enhance  the  mine  signatures  for  detection.  In  general, 
preprocessing  includes  ground- level  alignment  and  signal  and  noise  background  removal.  First,  we 
identify  the  location  of  the  ground  bounce  as  the  signal’s  peak  and  align  the  multiple  signals  with 
respect  to  their  peaks.  This  alignment  is  necessary  because  the  vehicle- mounted  system  cannot 
maintain  the  radar  antenna  at  a  fixed  distance  above  the  ground.  The  early  time  samples  of  each 
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Figure  53.  A  collection  of  few  GPR  scans 
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Figure  54.  NIITEK  radar  down-track  and  cross-track  (at  position  indicated  by  a  line  in  the  down- 
track)  B-scans  pairs  for  (a)  an  Anti-Tank  (AT)  mine,  (b)  an  Anti-Personnel  (AP)  mine,  and  (c)  a 
non-metal  clutter  alarm. 
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signal,  up  to  few  samples  beyond  the  ground  bounce  are  discarded.  The  remaining  signal  samples 
are  divided  into  N  depth  bins,  and  each  bin  would  be  processed  independently.  The  reason  for  this 
segmentation  is  to  compensate  for  the  high  contrast  between  the  responses  from  deeply  buried  and 
shallow  anomalies.  Next,  the  adaptive  least  mean  squares  (LMS)  pre-screener  [93]  is  used  to  focus 
attention  and  identify  regions  with  subsurface  anomalies.  The  goal  of  a  pre-screener  algorithm,  in  the 
framework  of  vehicle-mounted  realtime  landmine  detection,  is  to  flag  locations  of  interest  utilizing 
a  computationally  inexpensive  algorithm  so  that  more  advanced  feature-processing  approaches  can 
be  applied  only  on  the  small  subsets  of  data  flagged  by  the  pre-screener.  The  LMS  is  applied  to  the 
energy  at  each  depth  bin  and  assigns  a  confidence  value  to  each  point  in  the  (cross-track,  down-track) 
plane  based  on  its  contrast  with  a  neighboring  region.  The  components  that  satisfy  empirically  pre¬ 
determined  conditions  are  considered  as  potential  targets.  Their  cross-track  xs,  and  down-track  ys 
positions  of  the  connected  component  center  are  reported  as  alarm  positions  for  further  processing 
by  the  feature-based  discrimination  algorithm  to  attempt  to  separate  mine  targets  from  naturally 
occurring  clutter. 

5.3  Feature  extraction 

The  goal  of  the  feature  extraction  step  is  to  transform  original  GPR  data  into  a  sequence  of 
observation  vectors.  We  use  four  types  of  features  that  have  been  proposed  and  used  independently. 
Each  feature  represents  a  different  interpretation  of  the  raw  data  and  aims  at  providing  a  good 
discrimination  between  mine  and  clutter  signatures.  These  features  are  outlined  in  the  following 
subsections. 

5.3.1  EHD  features 

The  Edge  Histogram  Descriptors  (EHD)  [94]  captures  the  salient  properties  of  the  3-D  alarms 
in  a  compact  and  translation-invariant  representation.  This  approach,  inspired  by  the  MPEG-7 
EHD  [95],  extracts  edge  histograms  capturing  the  frequency  of  occurrence  of  edge  orientations  in 
the  data  associated  with  a  ground  position.  The  basic  MPEG-7  EHD  has  undergone  rigorous  testing 
and  development,  and  thus,  represents  one  of  the  generic  and  efficient  texture  descriptors.  For  a 
generic  image,  the  EHD  represents  the  frequency  and  the  directionality  of  the  brightness  changes 
in  the  image.  Simple  edge  detector  operators  are  used  to  identify  edges  and  group  them  into  five 
categories:  vertical,  horizontal,  45°  diagonal,  135°  antidiagonal,  and  isotropic  (nonedges).  The  EHD 
would  include  five  bins  corresponding  to  the  aforementioned  categories.  For  our  application,  we 


100 


adapt  the  EHD  to  capture  the  spatial  distribution  of  the  edges  within  a  3-D  GPR  data  volume. 
To  keep  the  computation  simple,  we  still  use  2-D  edge  operators.  In  particular,  we  fix  the  cross¬ 
track  dimension  and  extract  edges  in  the  (depth,  down-track)  plane.  The  overall  edge  histogram  is 
obtained  by  averaging  the  output  of  the  individual  (depth,  down-track)  planes.  Also,  since  vertical, 
horizontal,  diagonal,  and  antidiagonal  edges  are  the  main  orientations  present  in  the  mine  signatures, 
we  keep  the  five  edge  categories  of  the  MPEG- 7  EHD. 

Let  be  the  xth  plane  of  the  3-D  signature  S(x,  y ,  z).  First,  for  each  Siy\  we  compute  four 
categories  of  edge  strengths:  vertical,  horizontal,  45°  diagonal,  135°  antidiagonal.  If  the  maximum 
of  the  edge  strengths  exceeds  a  certain  preset  threshold  Oq ,  the  corresponding  pixel  is  considered  to 
be  an  edge  pixel.  Otherwise,  it  is  considered  a  nonedge  pixel. 

In  the  HMM  models,  we  take  the  down-track  dimension  as  the  time  variable  (i.e. ,  y  corre¬ 
sponds  to  time  in  the  HMM  model).  Our  goal  is  to  produce  a  confidence  that  a  mine  is  present  at 
various  positions,  (x,y),  on  the  surface  being  traversed.  To  fit  into  the  HMM  context,  a  sequence  of 
observation  vectors  must  be  produced  for  each  point.  The  observation  sequence  of  S^y  at  a  fixed 
depth  z,  is  the  sequence  of  T  observation  vectors  H^y] ,  i  =  1,  •  •  •  ,  T,  each  represents  a  five-bin  edge 
histogram  correspondent  to  siy]. 

The  overall  sequence  of  observation  vectors  computed  from  the  3-D  signature  S(x,y,z)  is 

then: 


^(*G  2/j -2 ')  [H zyn  H Zy2j  ‘r  i  H ZyT\,  (105) 

where  Hzy.  is  the  cross-track  average  of  the  edge  histograms  of  subimage  siy]  over  Nc  channels,  i.e. 

Nc 


H  —  i/G) 

zyi  -  Nr,  2-^nzyi- 


Nc 


(106) 


X=1 


The  extraction  steps  of  the  EHD  features  are  illustrated  in  Figure  55. 

Figures  56  and  57  display  the  edge  histogram  features  for  a  strong  mine  and  a  clutter 
encounter  identified  by  the  pre-screener  due  to  its  high-energy  contrast.  As  it  can  be  seen,  the  EHD 
of  the  mine  encounter  can  be  characterized  by  a  strong  response  of  the  diagonal  and  anti-diagonal 
edges.  Moreover,  the  frequency  of  the  diagonal  edges  is  higher  than  the  frequency  of  the  anti-diagonal 
edges  on  the  left  side  of  the  image  (rising  edge  of  the  signature)  and  lower  on  the  right  part  (falling 
edge).  This  feature  is  typical  in  mine  signatures.  The  EHD  features  of  the  clutter  encounter,  on 
the  other  hand,  does  not  follow  this  pattern.  The  edges  do  not  follow  a  specific  structure,  and  the 
diagonal  and  anti-diagonal  edges  are  usually  weaker. 
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Figure  55.  Illustration  of  the  EHD  feature  extraction  process. 


5.3.2  Gabor  features 

Gabor  features  characterize  edges  in  the  frequency  domain  at  multiple  scales  and  orientations 
and  are  based  on  Gabor  wavelets  [33].  This  feature  is  extracted  by  expanding  the  signature’s  B-scan 
(depth  vs.  downtrack)  using  a  bank  of  scale  and  orientation  selective  Gabor  filters.  Expanding  a 
signal  using  Gabor  filters  provides  a  localized  frequency  description.  In  our  experiments,  without 
loss  of  generality,  we  use  a  bank  of  filters  tuned  to  the  combination  of  two  scales,  at  octave  intervals, 
and  four  orientations,  at  45-degree  intervals. 

Let  Sx(y,  z )  be  the  xth  plane  of  the  3  —  D  signature  S(x,  y ,  z).  Let  SG^k\x,  y,  z),  k  =  1, ,  8 
denote  the  response  of  Sx(y,z)  to  the  eight  Gabor  filters.  Figure  58  displays  the  response  of  three 
distinct  signatures  to  the  eight  Gabor  filters.  Figure  58(a)  shows  the  response  to  a  strong  mine 
signatures.  For  this  alarm,  the  signature  has  a  strong  response  to  the  45-degree  filters  (at  both 
scales)  on  the  left  part  of  the  signature  (rising  edge),  and  a  strong  response  to  the  135-degree  filters 
on  the  right  part  of  the  signature  (falling  edge).  Similarly,  the  middle  of  signature  has  a  strong 
response  to  the  horizontal  filters  (flat  edge).  Figure  58(b)  displays  the  response  of  a  weak  mine 
signature.  For  this  signature,  the  edges  are  not  as  strong  as  those  in  Figure  58(a).  As  a  result,  it 
has  a  weaker  response  at  both  scales,  especially  for  the  rising  edge.  Figure  58(c)  displays  a  clutter 
signature  (identified  by  the  pre-screener)  and  its  response.  As  it  can  be  seen,  this  signature  has 
strong  response  to  the  45-degree  filters.  However,  this  response  is  not  localized  on  the  left  side  of 
the  signature,  and  is  not  followed  by  a  falling  edge  as  it  is  the  case  for  most  mine  signatures. 
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Figure  56.  EHD  features  of  a  strong  mine  signature:  (a)  Mine  signature  in  the  (depth,  down-track) 
plane,  (b)  Edge  orientation  of  each  location  of  the  signature,  (c)  EHD  features  for  the  15  observations 


103 


Scam 

(a) 


■  Horizontal 

■  Vertical 

■  Diagonal 
■Anti  Diagonal 

I  I  Non  Edge 


(b) 


(c) 


Figure  57.  EHD  features  of  a  clutter  encounter:  (a)  clutter  GPR  signature  in  the  (depth,  down- 
track)  plane,  (b)  edge  orientation  of  each  location  of  the  signature,  (c)  EHD  features  for  the  15 
observations 
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Figure  58.  Response  of  three  distinct  alarm  signatures  to  Gabor  filters  at  two  scales  and  four 
orientations. 

The  observation  sequence  of  Sx(y,  z)  at  a  fixed  depth  z,  is  the  sequence  of  observation  vectors: 
0{  X,  y-p,z),  0(x,  y-p  +  l,z),...,  0(x,  0(x,  y,  z ),  0{  x,  y  +  l,z),...,  0(x,  y  +  p,z),  (107) 


where 


and 


0(x,  y,  z)  =  . . .  ,0&{x,y,z)\ 


(108) 


Ok{x,y,z)  =  YJSG^\y,z) 


(109) 


zEw 


In  (109),  Ok(x,y,z),  encodes  the  response  of  S(x,y,z)  to  the  kth  Gabor  filter,  and  re  is  a 
depth  window  around  the  depth  being  considered. 
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5.3.3  Gradient  features 

This  feature  encodes  the  degree  to  which  edges  occur  in  the  diagonal  and  anti-diagonal 
directions  within  a  B-scan  [86].  First,  we  compute  the  first  and  second  derivative  on  the  raw  signal 
S(x,y,z)  along  the  down-track  (■ y )  direction  to  remove  the  stationary  effects  and  accentuate  the 
edges  in  the  diagonal  and  antidiagonal  directions  using: 


Dv(x,y,z)  = 
Dvy(x,y,z)  = 


[ffpE; 2/  +  2,  z)  +2S(x,y,z)  -  2S(x,y-  1  ,z)  -  2 ,z) 

3 

[Dy(x,  y  +  2,z)  +  2 Dy(x,  y-l,z)~  Dy(x,  y-  2,  z)] 


Then,  the  derivative  values  are  normalized  using: 


N (x,  y,  z)  = 


Dyy(x,y,z)  -  n(x,z ) 


where  y(x,z)  and  a(x,z)  are  the  running  mean  and  standard  deviation  updated  using  a  small 
background  area  around  the  target  flagged  by  the  prescreener. 

The  down-track  dimension  is  taken  as  the  time  variable  in  the  HMM  model.  The  observation 
vector  at  a  point  (xs,ys)  consists  of  a  set  of  15  features  that  are  computed  on  a  normalized  array 
of  GPR  data  of  size  32  x  8.  For  a  given  xs  and  ys,  let  A  =  A(y ,  z)  =  N(x,  y ,  z),  where  x  —  xSly  — 
ys  —  3, . . . ,  ys  +  4,  and  2  =  1,  2, . . . ,  32.  The  array  A  is  then  broken  into  positive  and  negative  parts 
according  to  the  formulas: 


A+(y,z)  = 


A(y,z)  if  A(y,  z)  >  1 


otherwise 


a  0 y ,  --)  = 


-A(y,z)  if  A(y,  z)  <  —1 


otherwise. 


Next,  for  each  point  in  the  positive  and  negative  parts  of  A ,  the  strengths  of  the  diagonal 
and  anti-diagonal  edges  are  estimated.  The  strengths  are  measured  by  taking  the  local  minimum  in 
either  the  45°  or  135°  direction  around  the  column  ys  +  1.  Four  types  of  edges  that  correspond  to 
the,  positive  anti-diagonal  (PA),  negative  anti-diagonal  (NA),  positive  diagonal  (PD),  and  negative 
diagonal  (ND)  edges  are  defined.  These  edges  are  computed  using 

PA(z)  =  mmA+(ys,z-l),A+(ys  +  l,z),A+(ys  +  2,z  +  l),A+(ys  +  3,z  +  2) 

NA(z)  =  minA~(ys,z-l),A~(ys  +  l,z),A~(ys+2,z  +  l),A~(ys+3,z  +  2) 

PD(z)  =  min  A+(ys,  z  +  2),  A+  (ys  +  1,  z  +  l),A+(ys+  2,  z),A+(ys+  3,  z  -  l) 


ND(z)  =  min  A  (ys,z  +  2),A  (ya  +  l,z  +  2),A  (ys  +  2,z),A  (ya  +  3,z-l) 
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For  each  edge  type,  we  find  the  position  of  the  maximum  value  over  a  neighborhood  of  32 
depth  values.  For  example,  in  the  array  PA  we  compute 

mpa  =  argmax{PA(z)  :  z  =  1,  2,  •  •  •  ,  32}  (114) 

where  mpa  denotes  ” maximum  of  the  positive  anti-diagonal”.  The  variables  mpd ,  mna ,  and  mnd 
are  defined  similarly.  The  values  of  the  positive  and  negative  diagonal  and  anti-diagonal  arrays 
are  used  to  define  the  4-dimensional  (4-D)  observation  vector  associated  with  the  point  ( xs,ys ), 
0(xs,ys )  =  [PD(rnpd),PA(mPa),ND(mnd),NA(mna)]-  Observation  sequences  of  length  15  are 
formed  at  point  (x,  y)  by  extracting  the  observation  sequence: 

0(x,  y  -  7),  0(x,  y  -  6),  •  •  •  ,  0(x,  y  -  1),  0(x,  y),  0(x,  y  +  !),•••  ,  0(x,  y  +  7). 


5.3.4  Bar  features 

The  bar  features  have  been  first  used  in  the  context  of  handwritten  character  recognition 
[96].  Here,  we  adapt  them  to  GPR  data  as  follows.  For  each  Ndepths-by-Nscans  B-scan,  the  original 
image  is  normalized  and  broken  into  positive  and  negative  sub-images.  Let  S^£y  be  the  xth  plane 
of  the  3-D  signature  S(x,y,z).  Then,  each  part  of  S  is  binarized  using  a  constant  threshold.  Eight 
feature  images  ( H+(z,y ),  D+(z,y),  V+(z,y),  A+(z,y),  H~(z,y),  D~(z,y),  V~(z,y),  and  A~(z,y )) 
are  generated.  Each  feature  image  corresponds  to  one  of  the  directions:  Horizontal,  Diagonal, 
Vertical,  and  Antidiagonal  in  either  the  positive  or  negative  binarized  image.  Each  feature  image 
has  an  integer  value  at  each  location  that  corresponds  to  the  number  of  successive  pixel  values  equal 
to  1  (or  -1  for  the  negative  images)  in  that  direction.  These  numbers  are  computed  efficiently  using 
the  two-pass  procedure  described  in  Algorithm  7.  The  feature  image  in  each  direction  is  obtained 
by  simple  summation  of  the  feature  images  for  the  positive  and  negative  sub-images,  using  : 

’  //(:.//)  //•(-.//)  +  //  (>..,/) 

D(z,  y)  =  D+(z,y)  +  D~(z,y) 

< 

V(z,y)  =  V+(z,y)  +  V-(z,y) 
k  A(z,y)  =  A+(z,y)  +  A~(z,y). 

An  example  of  an  original  GPR  B-scan  image  and  the  derived  bar  features  images  is  shown 
in  figure  59.  Each  entry,  in  the  matrix  H  for  example,  is  given  a  value  equal  to  the  longest  horizontal 
bar  that  fits  in  the  horizontal  direction  around  that  entry.  Thus,  the  bar  feature  matrices  H,  D, 
V,  and  A  measure  the  strength  of  the  horizontal,  diagonal,  vertical,  and  anti-diagonal  edge  at  each 
point  of  the  original  GPR  image. 
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Algorithm  7  Pseudo-code  for  computing  the  bar  features 

Require:  2-D  matrix  of  binarized  GPR  image  (Ndepths  x  Nscans). 

Ensure: 

l:  {FORWARD  PASS} 

2:  for  2  =  2,  •  •  •  ,  Ndepths  —  1  do 
3:  for  y  =  2,  •  •  •  ,  Nscans  —  1  do 

4:  H(z,y)=H(z,y-l)+l 

5:  D(z,y)=D(z-l,y+l)+l 

6:  V(z,y)=V(z-l,y)+l 

7:  A(z,y)=A(z-l,y-l)+l 

8:  end  for 

9:  end  for 

10:  {BACKWARD  PASS} 

ll:  for  z  =  Ndepths  —  1,  •  •  •  ,2  do 

12:  for  y  =  Nscans  —  1,  •  •  •  ,  2  do 

13:  H(z,y)=max(H(z,y),H(z,y+l)) 

14:  D(z,y)=max(D(z,y),D(z+l,y-l)) 

15:  V(z,y)=max(V(z,y),V(z+l,y)) 

16:  A(z,y)=max(A(z,y),A(z+l,y+l)) 

17:  end  for 

18:  end  for 
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Figure  59.  A  mine  B-scan  image  and  the  bar  features  matrices  for  the  horizontal  (H),  diagonal  (D), 
vertical  (V),  and  anti-diagonal  (A)  directions 
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To  derive  meaningful  features  for  HMM  modeling,  the  feature  images  are  combined  into  a 
5-bin  histogram:  one  bin  for  each  direction  and  the  5th  for  the  non-edge  direction.  In  particular,  if 
the  maximum  of  the  bar  matrices  exceeds  a  certain  preset  threshold  Ob ,  the  corresponding  pixel  is 
considered  to  be  an  edge  pixel  in  the  corresponding  direction.  Otherwise,  it  is  considered  a  non-edge 
pixel.  For  a  given  channel  x  G  {1,  •  •  •  ,  TVc},  the  corresponding  histogram,  f)^,  is  computed  from 
the  features  matrices  {H(z,  y),D(z,  y),  V (z,  y),  A(z ,  y)}.  The  overall  sequence  of  observation  vectors 
computed  from  the  3-D  signature  S(x,y,z)  is  then: 

0(x,y)  =  [Hzyi,H 

ZV2  •>  '  '  '  |  HZyT\i  (115) 

where  Hzy.  is  the  cross-track  average,  Nc  channels,  of  the  bar  features  histograms: 

_  1  NC 

=  iv?:E"S!-  (116> 

X=1 

5.4  Baseline  HMM  detector 

The  baseline  HMM  classifier  consists  of  two  HMM  models,  one  for  mine  and  one  for  back¬ 
ground.  Each  model  has  four  states  and  produces  a  probability  value  by  backtracking  through  model 
states  using  the  Viterbi  algorithm  [8].  The  mine  model,  Am,  is  designed  to  capture  the  hyperbolic 
spatial  distribution  of  the  features.  It  uses  observation  vectors  that  encode  the  degree  to  which  edges 
occur  in  the  horizontal  (H),  vertical  (V),  diagonal  (D),  anti-diagonal  (A),  and  non-edge  (N)  direc¬ 
tions.  This  model  assumes  that  mine  signatures  have  a  hyperbolic  shape  comprised  of  a  succession 
of  rising,  horizonal,  and  falling  edges  with  variable  duration  in  each  state.  Eventually,  the  beginning 
and  the  end  of  the  observation  vectors  correspond  to  non-edge  (or  background)  state. 

The  4-state  mine  model  is  illustrated  in  figure  60.  In  this  case,  the  mine  model  is  a  cyclic 
model  with  the  following  architecture  constraints: 

•  The  system  starts  always  at  state  1; 

•  When  in  state  i,  the  system  can  move  only  to  state  i  or  state  i- hi,  i  =  1  •  •  •  N  —  1; 

•  From  state  TV,  the  system  can  move  to  state  N  or  state  1. 

The  background  model,  A6,  is  needed  to  capture  the  background  and  clutter  characteristics. 
No  prior  information  or  assumptions  are  used  in  this  model. 

The  probability  value  produced  by  the  mine  (background)  model  can  be  thought  of  as  an 
estimate  of  the  probability  of  the  observation  sequence  given  that  there  is  a  mine  (background) 
present.  The  model  architecture  is  illustrated  in  figure  61. 
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states 
Mine  state  1 
Mine  state  2 
Mine  state  3 
Mine  state  4 


Mine  state  2 


Mine  state  4 


Mine  state  3 


Mine  state  1 


Figure  60.  Illustration  of  the  baseline  HMM  mine  model  with  four  states. 


Figure  61.  Illustration  of  the  baseline  HMM  classifier  architecture 

To  generate  the  codebook,  we  cluster  the  training  data  corresponding  to  each  class  (mine 
or  background)  into  M  clusters  using  the  k- means  algorithm  [23].  The  initial  model  parameters 
are  set  according  to  the  distance  of  the  M  codebooks  to  each  of  the  N  states,  using  equation  (69). 
The  transition  probabilities  A  and  the  observation  probabilities  B  are  then  estimated  using  the 
Baum- Welch  algorithm  [9],  the  MCE/GPD  algorithm  [11],  or  a  combination  of  the  two. 

The  confidence  value  assigned  to  each  observation  sequence,  Conf(O),  depends  on:  (1)  the 
probability  assigned  by  the  mine  model,  Pr(0  |Am);  (2)  the  probability  assigned  by  the  background 
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model,  Pr(0 |AC);  and  (3)  the  optimal  state  sequence.  In  particular,  we  use: 


Conf(O)  = 


max(logf^3,0)  if  #{st  =  l,^  =  !,•••  ,T}<Tmax 
0  otherwise 


(117) 


where  #{st  =  1,  t  =  1,  •  •  •  ,T}  corresponds  to  the  number  of  observations  assigned  to  the  background 
state  (state  1).  Tmax  is  defined  experimentally  based  on  the  shortest  mine  signature.  Equation  (117) 
ensures  that  sequences  with  a  large  number  of  observations  assigned  to  state  1  are  considered  non¬ 
mines. 


5.5  Ensemble  HMM  landmine  detector 

Mine  signatures  vary  according  to  the  mine  type,  mine  size,  and  burial  depth.  Similarly, 
clutter  signatures  vary  with  soil  type,  site  moisture,  and  other  noise  and  environmental  factors. 
The  eHMM  approach  attempts  to  learn  these  variations  by  constructing  an  ensemble  of  HMMs  that 
captures  the  characteristics  and  variability  within  the  training  data.  In  the  following,  we  assume 
that  the  training  data  consists  of  a  set  of  R  sequences  of  length  T. 

The  architecture  of  the  eHMM  landmine  classifier  is  shown  in  figure  62.  It  has  five  main 
components  namely,  feature  extraction,  similarity  matrix  computation,  similarity-based  clustering, 
clusters’  models  construction,  and  decision  level  fusion.  In  the  first  step,  the  EHD,  Gabor,  gradient, 
and  bar  features  are  extracted  as  detailed  in  section  5.3.  In  the  second  step,  each  of  the  R  sequences 
is  used  to  learn  one  HMM  model.  The  model  parameters  are  initialized  and  subsequently  adjusted, 
using  the  Baum-welch  algorithm,  to  fit  the  corresponding  sequence.  The  third  step  consists  of 
building  a  pairwise  similarity  matrix  based  on  the  log-likelihood  score  and  the  Viterbi  path  mismatch 
penalty  that  result  from  testing  each  sequence  in  each  model.  The  mixing  factor  a  in  (74)  is  set 
empirically  to  0.1.  In  the  fourth  step,  we  use  a  standard  hierarchical  clustering  algorithm  to  partition 
the  data  into  groups  (clusters)  of  similar  signatures.  Then,  for  each  cluster,  an  HMM  model  is 
initialized  and  trained  using  the  signatures  belonging  to  that  cluster.  The  devised  training  scheme 
depends  on  the  cluster  homogeneity  and  size.  Finally,  in  the  fifth  step,  we  use  a  HME  or  ANN  to 
combine  the  outputs  of  each  model  and  assign  a  single  confidence  value  to  a  given  test  sequence. 
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Figure  62.  Block  diagram  of  the  proposed  eHMM  landmine  classifier  (training) 


5.6  Experimental  results 
5.6.1  Dataset  collections 

The  proposed  eHMM  was  developed  and  tested  on  GPR  data  sets  collected  with  a  NIITEK 
vehicle  mounted  GPR  system  [3]  (see  figure  52).  The  datasets  are  comprised  of  a  variety  of  mine  and 
background  signatures.  We  use  data  collected  from  outdoor  test  lanes  at  three  different  locations. 
The  first  two  locations,  site  1  and  site  2,  were  temperate  regions  with  significant  rainfall,  whereas 
the  third  collection,  site  3,  was  a  desert  region.  The  lanes  are  simulated  roads  with  known  mine 
locations.  Lanes  at  site  1  are  labeled  lanes  1,  3,  and  4,  and  are  500  meters  long  and  3  meters  wide. 
Lanes  at  site  2  are  labeled  lanes  3,  4,  13,  14,  and  19,  and  are  50  to  250  meters  long  and  3  meters 
wide.  Lanes  at  site  3  are  labeled  lanes  51  and  52,  and  are  300  meters  long  and  3  meters  wide. 
Multiple  data  collections  were  performed  at  each  site  at  different  dates.  Statistics  of  these  datasets 
are  reported  in  table  24.  In  the  first  dataset  collection,  the  LMS  pre-screener  has  identified  a 
total  of  1843  alarms,  613  of  them  are  mine  encounters.  The  second  and  third  dataset  collections, 
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£)2  and  £)3  are  obtained  form  the  same  GPR  raw  data,  but  using  different  preprocessing  and  pre¬ 
screening  methods.  Based  on  the  first  preprocessing  method,  the  prescreener  used  for  £)2  identifies 
724  mine  encounters  and  619  false  alarms.  With  the  second  preprocessing  method,  used  for  £)3,  the 
prescreener  identifies  732  mine  encounters,  i.e.  eight  more  mines  than  £)z,  at  the  expense  of  1126 
additional  false  alarms. 


TABLE  24 
Data  collections 


Total  pre-screener  alarms 

Mine  encounters 

False  alarms 

®1 

1843 

613 

1230 

s>2 

1343 

724 

619 

®3 

2477 

732 

1745 

5.6.2  Data  preprocessing  and  features  extraction 

5. 6. 2.1  Data  preprocessing 

For  each  dataset  collection,  the  raw  GPR  data  is  preprocessed  and  prescreened,  as  detailed 
in  section  5.2,  to  identify  potential  locations  of  interest  within  the  GPR  signals.  Those  locations, 
also  called  alarms,  can  be  mine  targets  or  naturally  occurring  clutter.  The  prescreener  alarms,  in 
the  form  of  a  3-D  GPR  volume  of  data,  are  then  used  to  extract  the  various  features  as  described 
in  section  5.3. 

Each  alarm  has  over  400  depth  values,  51  channel  samplings,  and  61  scans.  Experimental 
results  show  that  we  can  reduce  original  data  without  affecting  the  overall  performance.  Thus 
we  down-sample  the  GPR  data  along  depth  (by  a  factor  of  4)  and  we  process  only  7  channels 
centered  around  the  channel  with  maximum  prescreener  confidence.  The  prescreening  step  results 
in  alarms  that  are  centered  around  the  middle  scan  position.  Therefore,  we  use  only  the  middle 
15  scans  (24,  •  •  •  ,  38)  of  the  original  prescreener  alarm.  The  first  and  last  few  scans  are  typically 
used  for  preprocessing  (background  variance  estimation,  background  substruction,  etc...).  It  is  worth 
noting  that  the  ground-truth  depths  of  mine  signatures  are  either  manually  identified  or  previously 
estimated  by  the  prescreener.  Those  groundtruth  depths  are  used  only  in  the  training  step. 

5. 6. 2. 2  Features  extraction 

The  previous  step  resulted  in  a  smaller  7  x  15  x  100  GPR  volume.  Consequently,  each 
of  the  feature  extraction  methods,  described  in  section  5.3,  is  used  to  extract  a  feature  vector  at 
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each  down-sampled  depth.  Depending  on  the  feature  extraction  method,  averaging  over  the  seven 
channels  is  done  within  the  extraction  procedure  (for  EHD)  or  post-extraction  (for  the  Gabor,  bar, 
and  gradient  features). 

Given  the  downsampled  volume  of  GPR  data  obtained  in  the  previous  step,  we  apply  each  of 
the  extraction  methods  to  compute  the  sequence  of  feature  vectors  at  each  depth.  Each  extraction 
method  results  in  a  map  of  15  sequences  of  4-  or  5-dimensional  feature  vectors  at  each  down-sampled 
depth.  Let  D[EHD ,  Ogbr ,  0GRA ,  and  Obar  denote  the  observation  vector  map  extracted  from 
dataset  £)[,  {/  =  1,2,3},  using  the  EHD,  Gabor,  gradient,  and  bar  features,  respectively. 

In  figure  63,  we  show  the  observation  vectors  extracted  from  three  signatures.  The  first  (resp. 
second)  signature  corresponds  to  a  mine  with  high  (resp.  low)  metal  content  taken  at  the  groundtruth 
depth  and  at  the  center  channel  identified  by  the  prescreener.  The  third  signature  corresponds  to 
a  subwindow  of  an  alarm  taken  at  a  random  depth  with  no  significant  GPR  signature  activity. 
Note  that  the  EHD  and  Gabor  features  are  normalized,  while  the  gradient  and  bar  features  values 
have  wider  ranges.  The  extracted  feature  vectors  cover  different  aspects  of  the  GPR  signature.  For 
instance,  Gabor  features  are  less  prone  to  the  intensity  in  the  image  and  vary  mainly  with  the  size 
of  the  target  and  the  texture  of  the  signature. 


EHD  Features  Gabor  Features 


BAR  Histogram  Features 


adOh 

12345678  9101112131415 


M 


1  2  3  4  5  6  7  8  9101112131415  1  2  3  4  5  6  7  8  9101112131415 


1  2  3  4  5  6  7  8  9101112131415 


■I  mm 


1  2  3  4  5  6  7  8  9101112131415 


1  2  3  4  5  6  7  8  9101112131415 


1  2  3  4  5  6  7  8  9101112131415 


^■pd 

1  Iv 

1  IND 

1  1 D 

1  IV 

r  IPA 

1-  .1  A 

^■na 

I  ID 
CZZlv 


Figure  63.  EHD,  Gabor,  gradient,  and  bar  features  of  two  sample  mines  and  a  background  signature 
In  figure  64,  we  show  the  means  of  the  extracted  feature  vector  observations  of  all  the 
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(a)  Mines  using  EHD 


(b)  Clutter  using  EHD 
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(h)  Clutter  using  bar 


Figure  64.  Average  feature  observation  vectors  of  mines  (column  1),  and  clutter  (column  2)  using 
EHD  (row  1),  Gabor  (row  2),  gradient  (row  3),  and  bar  extraction  method  (row  4) 


mines  (first  column  subfigures)  and  clutter  (second  column  subfigures)  of  J)j_.  For  each  feature 
Feat  G  {EHD,  GBR,  GRAYBAR},  we  compute  0^)eat  =  average{0^)  G  D±Feat}  for  mines  and 
clutter  signatures  separately.  Then,  the  5-  or  4-dimensional  mean  vector  of  the  EHD  (row  1),  Gabor 
(row  2),  gradient  (row  3),  and  bar  features  (row  4  sub-figures)  is  plotted  at  each  observation  position 
t,  1  <  t  <  T.  These  feature  dependent  global  statistics  could  be  used  to  set  the  thresholds  of  the 
proposed  algorithms  as  outlined  in  the  next  section. 
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5. 6. 2. 3  Identifying  actual  length  of  mine  sequences 


In  order  to  satisfy  the  eHMM  architecture  constraints,  we  use  equation  (73)  with  the  feature- 
dependent  threshold  r  to  identify  the  actual  length  of  the  mine  sequences.  From  figure  64(a), 
It  can  be  seen  that  choosing  r  =  0.18,  for  the  EHD  features,  would  result  in  discarding  a  few 
mine  observations  with  weak  diagonal  component  from  the  start  of  the  sequence  and  a  few  mine 
observations  with  weak  anti-diagonal  component  at  the  end  of  the  sequence.  Similarly,  for  the  other 
features  and  referring  to  figures  64(c),  (e)  and  (g),  respectively,  we  fix  r  =  0.7  for  the  Gabor  features, 
r  =  0.3  for  the  gradient  features,  and  r  =  0.11  for  the  bar  features. 

5.6.3  Experimental  setup 

In  all  experiments  reported  this  chapter,  we  use  a  6-fold  cross  validation  for  each  data  set 
2)i,  {/  =  1,2,3}.  For  each  fold,  a  subset  of  the  data  (QiTm)  18  used  for  training  and  the  remaining 
data  (®iTst)  is  used  for  testing.  Let  Rln  be  the  size  of  £)(Trn  and  Qln  =  Nl  —  Rln  be  the  size  of  £hTst 
for  the  nth  fold.  Finally  let  0\Trn  denote  the  observation  vector  maps  extracted  from  dataset  £)[, 
{/  =  1,  2,  3},  using  the  corresponding  feature  extraction  method  Feat  G  { EHD ,  GBR ,  GRA ,  BAR} 
(Feat  labels  ’EHD’,  ’GBR’,  ’GRA’,  and  ’BAR’  correspond  to  the  EHD,  Gabor,  gradient,  and  bar 
features  respectively). 

A  comparison  of  the  classifiers  and  the  feature-based  models  should  be  based  on  the  ROC 
curves  and  should  use  the  same  cross-validation  settings.  Therefore,  we  use  a  unified  cross-validation 
experimental  setup.  For  each  dataset,  we  have  the  same  crossvalidation  partition,  and  the  same 
setting  for  the  different  feature  extraction  methods.  In  particular,  the  EHD,  Gabor,  gradient,  and 
bar  features  are  pre-computed  and  saved  for  each  dataset  and  will  be  used  by  all  classifiers,  namely 
the  baseline  DHMM,  the  baseline  CHMM,  the  eDHMM,  and  the  eCHMM.  In  the  following  results 
sections,  we  provide  the  implementation  details  of  the  eCHMM  and  eDHMM  on  one  of  the  datasets 
Di  and  using  one  of  the  feature  extraction  methods  seemingly.  We  illustrate  some  of  the  steps  using 
£)3  and  the  eCHMM  with  bar  features.  In  the  last  results  section,  we  report  the  comprehensive 
results,  in  terms  of  area  under  ROC  curve  (AUC),  of  the  eHMM  vs  the  baseline  HMM:  (1)  in  the 
continuous  vs.  discrete  case;  (2)  using  EHD,  Gabor,  gradient,  and  bar  features;  and  (3)  using  £R, 
£)2,  and  £)3. 
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(b) 


Figure  65.  eCHMM  hierarchical  clustering  of 


(a)  pairwise-distance  matrix,  (b)  dendrogram 


5.6.4  eHMM  construction  and  training  steps 

For  each  dataset  £h,  Z  E  1,2,  3,  and  each  feature  extraction  method,  Feat  E  { EHD ,  GBR ,  GRA ,  HAR}, 
we  have  a  6- fold  crossvalidation  partition  D[Feat={D(^f,  °f  observation  maps.  Each  alarm 

in  the  original  GPR  data  is  represented  by  a  T  x  D  sequence  of  feature  vector  observations  at  each 
down-sampled  depth  location. 

The  first  step  of  the  eHMM  is  the  similarity  matrix  computation.  This  step  requires  fitting 
an  individual  HMM  model  for  each  sequence  in  the  training  data  OvrTn  -  Details  of  the  individual 
DHMM  construction  and  the  similarity  matrix  computation  of  the  first  crossvalidation  fold  of  dataset 
2)3  using  the  EHD  features,  £3^^,  were  provided  in  the  running  example  of  chapter  3. 

In  the  second  step,  the  similarity  matrix  is  transformed  into  a  distance  matrix,  D,  using  (77). 

The  hierarchical  clustering  algorithm  [24]  is  then  applied,  using  D  with  a  fixed  number  of  clusters 
K  =  10,  to  partition  the  training  data.  For  both  the  discrete  and  continuous  versions,  using  any 
of  the  features  and  dataset,  the  eHMM  clustering  step  successfully  assigns  groups  of  similar  alarms 
into  clusters.  For  instance,  in  figures  65  and  66,  we  show  the  hierarchical  clustering  results  of  the 
first  crossvalidation  fold  of  the  eCHMM  using  bar  features  on  2)3.  As  it  can  be  seen  in  figure  66(a) 
and  in  the  dendrogram  of  figure  65(b),  we  have  a  group  of  clutter  dominated  clusters  (in  red)  an  a 
second  group  of  clusters  dominated  by  mines  (in  cyan).  Additional  details  of  the  clusters’  contents 
per  mine  type  and  per  burial  depth  are  shown  in  figures  66(b)  and  (c). 

In  the  third  step,  we  build  the  eHMM  mixture  models  for  the  clusters  obtained  from  the 
previous  step.  The  eHMM  is  built  based  on  the  clusters’  sizes  and  homogeneity.  Then,  the  context 
dependent  training  scheme,  described  in  section  3.3.3,  is  used  to  initialize  and  learn  the  eHMM. 
Continuing  with  the  case  example  of  the  eCHMM  with  BAR  features,  of  the  previous  step, 

we  can  see  that  the  partition  of  the  training  data  includes  four  homogeneous  clusters  (Clusters  6,  7, 
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(a)  Distribution  of  the  alarms  in  each  cluster  per  class  (mines  vs.  clutter) 


1  23456789  10 

(b)  Distribution  of  the  mines  in  each  cluster  per  type  (Low  Metal  vs.  High  Metal) 


(c)  Distribution  of  the  mine  signatures  in  each  cluster  per  burial  depth 


Figure  66.  eCHMM  hierarchical  clustering  results  of  O^rni1  distribution  of  the  alarms  in  each 
cluster:  (a)  per  class,  (b)  per  type,  (c)  per  depth. 

and  10  contain  only  mines  and  cluster  1  has  only  clutter).  The  remaining  clusters  (2,  3,  4,  5,  8,  and 
9)  are  mixed.  Therefore,  using  the  notation  of  Algorithm  6,  we  define  our  eHMM  as  : 

\BW  _  a(f)  x(M)  X(M)  x(Mh 
Ai  —  lAl  5  A6  >  A7  5  A10  J> 

<  XfCE  =  {A(AO  A f)hj  €  {2,3,4,5,8,91,  (H8) 

^  \VkB  =  0. 

Table  25  shows  the  BW-trained  eCHMM  model  for  cluster  6,  AgM\  Recall  that  cluster  6 
contains  only  mine  signatures  that  have  a  low  metal  content  and/or  buried  at  2”  or  deeper,  as  it 
can  be  seen  in  figure  66.  Therefore,  the  alarms  in  cluster  6  are  expected  to  have  weak  signatures 
and  therefore  weak  edge  features.  This  could  explain  the  large  non-edge  component  of  most  of  the 
states  means  components  of  XqM ^  reported  in  table  25.  Moreover,  both  states  si  and  53  have  two 
dominant  components  (gn,  <712  for  s  1;  #32,  #33  for  S2)  and  one  weak  component.  In  both  cases, 
the  dominant  components  are  the  ones  with  the  largest  no-edge  attribute.  Nevertheless,  the  states 
representatives  still  characterize  the  hyperbolic  shape  of  a  typical  mine  signature,  i.e.  the  succession 
of  Dg  —  Hz  —  Ad  states.  For  instance,  all  s\  components  means  have  their  diagonal  D  dimension 
larger  than  the  anti-diagonal  A  dimension. 

In  the  final  step,  the  eHMM  mixture  constructed  previously  is  combined  using  an  ANN  or  a 
HME.  The  ANN  and  HME  parameters  are  trained  to  fit  the  responses  of  the  eHMM  mixture  models 
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TABLE  25 


Xq1  CHMM  model  parameters  of  cluster  6 


A  Means 


S'i 

52 

53 

H 

D 

V 

A 

N 

Si 

0.75 

0.25 

0.00 

Ml 

0.20 

0.21 

0.16 

0.05 

0.82 

52 

0.00 

0.81 

0.19 

si 

C12 

0.47 

0.31 

0.26 

0.15 

0.61 

53 

0.00 

0.00 

1.00 

C13 

0.60 

0.27 

0.34 

0.20 

0.36 

C2% 

0.55 

0.24 

0.25 

0.31 

0.57 

Priors 

52 

M2 

0.56 

0.39 

0.44 

0.36 

0.37 

7 r 

C23 

0.67 

0.30 

0.38 

0.43 

0.27 

si 

1.00 

Ml 

0.52 

0.14 

0.32 

0.36 

0.41 

52 

0.00 

53 

C32 

0.15 

0.04 

0.11 

0.16 

0.86 

53 

0.00 

C33 

0.36 

0.09 

0.23 

0.31 

0.66 

Covariance  Coefficients 


H 

D 

V 

A 

N 

w 

COV 11 

0.01 

0.01 

0.01 

0.01 

0.01 

9n 

0.42 

5l 

COV 12 

0.01 

0.01 

0.01 

0.01 

0.01 

912 

0.48 

COM3 

0.01 

0.01 

0.01 

0.01 

0.01 

913 

0.10 

COV  21 

0.01 

0.01 

0.01 

0.01 

0.01 

921 

0.46 

52 

COV2  2 

0.01 

0.01 

0.01 

0.01 

0.02 

922 

0.21 

COV2  3 

0.01 

0.01 

0.01 

0.01 

0.01 

923 

0.33 

COV  31 

0.01 

0.01 

0.01 

0.01 

0.01 

931 

0.10 

53 

COV  32 

0.01 

0.01 

0.01 

0.01 

0.01 

932 

0.50 

COV  33 

0.01 

0.01 

0.01 

0.01 

0.01 

933 

0.40 

to  the  training  data  labels.  For  each  crossvalidation  fold,  the  initial  and  trained  eHMM  models,  the 
trained  ANN,  and  the  trained  HME  are  saved  and  will  be  used  to  test  the  respective  subset. 


5.6.5  eHMM  testing 

The  block  diagram  for  the  proposed  eHMM  in  testing  mode,  was  given  in  figure  22.  In  the 
training  step,  for  each  crossvalidation  fold,  we  used  the  Lhjvrf  subset  to  learn  the  eHMM  mixture  and 
the  decision  level  fusion  model.  In  this  testing  step,  we  use  the  eHMM  models  and  the  decision  level 
component  of  the  eHMM,  obtained  from  the  training  step,  to  assign  a  confidence  to  a  new  sequence 
from  the  corresponding  crossvalidation  testing  subset  •  For  each  dataset  S[,  and  experimental 

setting  of  discrete  vs.  continuous  and  feature  extraction  method,  we  repeat  the  previous  step  for 
each  crossvalidation  fold.  The  results  for  a  particular  experimental  setting  are  summarized  in  terms 
of  receiver  Operating  Characteristics  (ROC)  and  Area  Under  ROC  Curve  (AUC).  In  this  section, 
we  investigate  the  performance  of  the  HME  and  ANN  fusion  methods  for  the  experimental  setting 
example  provided  in  the  eHMM  training  step.  Then,  we  motivate  our  choice  for  the  clustering 
method  and  the  number  of  clusters  in  the  case  of  D3  using  the  EHD-based  eDHMM.  Finally,  we 
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(a) 


(b) 


Figure  67.  ROCs  generated  by  the  HME  fusion,  ANN  fusion,  and  few  cluster  models  using  (a)  the 
training  data  O^rni  an(^’  03)  the  testing  data 


provide  comparative  results  of  the  eHMM  vs  the  baseline  HMM  on  all  datasets,  using  the  discrete 
and  continuous  HMM,  and  the  different  feature  extraction  methods. 


5. 6. 5.1  Comparison  of  the  HME  and  the  ANN  fusion  results 

Continuing  with  the  example  used  in  the  eHMM  training  step,  where  we  have  shown  the 
hierarchical  clustering  results  of  eCHMM  using  bar  features  on  we  reported  that  the  par¬ 

tition  of  the  training  data  includes  four  homogeneous  clusters  and  six  mixed  clusters.  In  figure 
67(a),  we  show  the  ROCs  generated  by  the  models  of  one  mine-dominated  cluster  (cluster  6),  one 
clutter-dominated  cluster  (cluster  1),  one  mixed  cluster  (cluster  2)  on  the  training  data  We 

show  also  the  ROCs  of  the  ANN  and  HME  fusion  that  were  obtained  by  combining  the  confidences 
of  all  ten  cluster  models.  In  figure  67(b),  we  show  the  ROCs  generated  by  the  same  classifiers  on  the 
corresponding  test  data  •  As  ^  can  seen  the  ROCs  of  figure  67,  the  HME  and  the  ANN 

improve  the  classification  performance  over  the  individual  clusters’  models  with  the  HME  fusion 
outperfroming  the  ANN  fusion,  specially  on  the  training  data.  In  the  remainder  of  our  experiments, 
we  use  the  HME  as  the  default  fusion  method. 


5. 6. 5. 2  Comparison  of  the  clustering  methods 

In  our  experiments,  we  noticed  that  the  clustering  methods  used,  namely  the  hierarchical 
agglomerative,  the  FLeCK,  and  the  spectral  FCM-MK,  yield  comparable  partitions  in  terms  of 
cluster  purity  (i.e.  mine  and  clutter  distributions  in  each  cluster).  Therefore,  the  overall  eHMM 
performance,  using  one  of  the  clustering  algorithms,  is  comparable  in  terms  of  the  ROC  or  the  AUC 
measure.  For  instance,  in  figure  68,  we  show  the  ROCs  generated  by  the  eDHMM  on  D5ehd  using 
the  hierarchical,  FLeCK,  and  FCM-MK  algorithms  with  K  =  10.  As  it  can  be  seen,  the  performance 
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Figure  69.  Optimization  of  the  number  of  clusters 

of  the  eDHMM  using  one  of  the  clustering  methods  is  comparable.  It  is  worth  noting  that  the  FLeCK 
and  FCM-MK  algorithms  are  paramater-based  and  may  have  inconsistent  results.  That  is,  for  the 
FLeCK  algorithm,  the  results  depend  on  the  initial  FCM  partition.  For  the  FCM-MK,  it  is  not 
guaranteed  that  the  clustering  results  in  exactly  K  clusters  as  the  final  partition  may  have  empty 
clusters.  Therefore,  we  will  use  the  hierarchical  agglomerative  clustering  algorithm  in  the  remainder 
of  our  experiments,  since  it  is  non-parametric  and  gives  consistent  results. 
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5. 6. 5. 3  Choice  of  the  number  of  clusters  K 


To  optimize  the  number  of  clusters,  a  performance  criterion  such  as  the  partition  purity, 
the  mean  of  the  inter-  and  intra-cluster  distances,  and  the  overall  performance  of  the  eHMM  can  be 
used.  In  our  case,  we  use  the  eHMM’s  AUC  as  a  criteria  for  choosing  the  optimal  number  of  clusters. 
For  that  matter,  we  conducted  an  empirical  study  by  varying  the  number  of  clusters  (K  =  1,  •  •  •  ,  50) 
and  measuring  the  performance  of  corresponding  eHMM  AUC  for  each  value  of  K.  In  particular, 
we  used  the  EHD-based  eDHMM  using  the  hierarchical  clustering  and  the  HME  fusion,  on  S)3,  to 
illustrate  the  optimization  of  the  number  of  clusters.  In  figure  69,  we  show  the  evolution  of  the 
eDHMM’s  AUC  with  the  number  of  clusters  used  in  the  eDHMM  hierarchical  clustering  component. 
The  trivial  choice  of  K  =  1  is  equivalent  to  the  baseline  DHMM  and  has  the  lowest  AUC  of  643, 
as  it  can  be  seen  in  figure  69.  For  K  =  2,  the  eDHMM  is  comprised  of  two  clusters:  one  cluster 
contains  the  mine  signatures  and  the  mine-like  clutter  signatures,  and  the  other  cluster  contains  the 
dominantly  background,  with  no  significant  GPR  activity,  clutter  signatures.  Apart  from  the  trivial 
cases  of  K  G  {1,2},  the  AUC  of  the  eDHMM  increases  with  number  of  clusters  until  it  reaches  a 
plateau  around  K  =  20.  Therefore,  we  pick  K  =  20  as  a  reasonable  tradeoff  between  the  eDHMM 
performance  and  its  complexity  in  terms  of  number  of  clusters/  mixture  models. 

5. 6. 5. 4  eHMM  vs.  baseline  HMM  results 

In  this  section,  we  highlight  the  relative  performance  of  the  eHMM  compared  to  the  baseline 
HMM.  For  the  eHMM,  we  show  the  results  using  the  HME  fusion  and  the  hierarchical  agglomerative 
clustering  with  K  =  20.  In  figures  70 (a)- (d),  we  show  the  ROCs  generated  by  the  discrete  versions  of 
the  eHMM  and  the  baseline  HMM  on  dataset  S>3,  using  one  of  the  feature  extraction  methods:  EHD, 
Gabor,  gradient,  and  bar  respectively.  Similarly,  in  figures  71  (a)- (d),  we  report  the  ROCs  generated 
by  the  continuous  versions,  i.e.  the  eCHMM  and  the  baseline  CHMM,  on  dataset  S3,  using  the 
EHD,  Gabor,  gradient,  and  bar  features,  respectively.  In  all  the  experimental  settings  using  J)3,  the 
ensemble  HMM  method  outperforms  the  baseline  HMM.  In  all  the  ROCs  of  figures  70  and  71,  at  a 
given  false  alarm  rate  (FAR),  the  eHMM  has  a  better  probability  of  detecting  targets.  For  instance, 
in  figure  70(a),  at  a  FAR  of  10%,  the  eDHMM  using  EHD  features  successfully  identifies  94%  of  the 
mines  while  the  baseline  DHMM  identifies  only  87%  of  the  targets.  At  the  same  FAR  of  10%,  the 
ROCs  of  figure  71(a)  show  that  the  eCHMM  successfully  identifies  95%  while  the  baseline  CHMM 
probability  of  detection  is  85%. 

To  gain  more  insight  on  the  relative  performance  of  the  eHMM  and  the  baseline  HMM 
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(a) 


(b) 


(c)  (d) 


Figure  70.  ROCs  generated  by  the  eDHMM  (solid  lines)  and  baseline  DHMM  (dashed  lines)  classi¬ 
fiers  using  S)3  and  (a)  EHD,  (b)  Gabor,  (c)  gradient,  and  (d)  bar  features 


(a) 


(b) 
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50  60  70  80  90  100 
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(C)  (d) 

Figure  71.  ROCs  generated  by  the  eCHMM  (solid  lines)  and  baseline  CHMM  (dashed  lines)  classifiers 
using  D3  and  (a)  EHD,  (b)  Gabor,  (c)  gradient,  and  (d)  bar  features 
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classifiers  onD3,  we  plot  the  scatter-plot  of  the  confidences  assigned  by  the  eCHMM  vs.  the  baseline 
CHMM  to  the  signatures  of  D3Tst,  in  figure  72.  The  scatter-plot  shows  that  the  eCHMM  and  CHMM 
classifiers  are  positively  correlated,  as  most  of  the  points  fall  along  the  diagonal  of  the  plot.  Both 
classifiers  assign  low  confidences  to  clutter  and  higher  confidences  to  mine  signatures  specially  high 
metal  (in  red  and  purple)  and  shallow  signatures  (in  red  and  black).  The  improvement  of  the  eHMM 
over  the  baseline  classifier  can  be  seen  particularly  in  the  areas  R\  and  R2 .  In  fact,  the  eCHMM 
assigns  higher  confidences  to  the  low-metal  signatures  enclosed  by  R\  while  the  baseline  CHMM 
assigns  relatively  lower  confidences.  Moreover,  the  eCHMM  assigns  relatively  low  confidences  (<  .8) 
compared  to  the  CHMM  (>  6),  to  the  clutter  signatures  of  region  R2. 


Figure  72.  Scatter  plot  of  the  confidences  assigned  by  the  EHD-based  eCHMM  (abscissa  axis)  and 
the  baseline  CHMM  (ordinate  axis)  to  £)3Tstl  signatures.  Clutter,  low  metal  (LM),  and  high  metal 
(HM)  signatures  at  different  depths  are  shown  with  different  symbols  and  colors. 


The  results  for  all  3  datasets  are  summarized  in  terms  of  and  Area  Under  ROC  Curve  (AUC) 
and  are  reported  in  table  26.  In  most  of  our  experiments  (24  settings  of  continuous  vs  discrete,  using 
EHD,  Gabor,  gradient,  or  bar  features,  for  each  of  the  three  datasets),  the  eHMM  outperforms  the 
baseline  HMM. 
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TABLE  26 


AUC  of  the  eHMM  and  baseline  HMM  classifiers 


Dataset 

Classifier  using: 

EHD 

Gabor 

Gradient 

Bar 

eDHMM 

343 

296 

191 

190 

2), 

bDHMM 

272 

122 

299 

367 

eCHMM 

326 

226 

119 

180 

bCHMM 

284 

140 

208 

173 

eDHMM 

402 

127 

498 

50 

2), 

bDHMM 

107 

30 

400 

314 

eCHMM 

359 

122 

497 

21 

bCHMM 

209 

102 

391 

38 

eDHMM 

712 

719 

698 

483 

2>3 

bDHMM 

643 

499 

582 

471 

eCHMM 

718 

617 

697 

284 

bCHMM 

614 

472 

579 

150 

5.7  Chapter  summary 

In  this  chapter,  the  proposed  eHMM  classifier  is  applied  and  evaluated  on  a  real  world 
application:  landmine  detection.  In  this  application,  the  eHMM  intermediate  steps  are  analyzed 
and  the  final  performance  results  are  presented.  The  experiments  show  that  the  proposed  eHMM 
intermediate  results  are  inline  with  the  expected  behavior  and  that,  overall,  the  ensemble  HMM 
outperforms  the  standard  HMM.  As  reported  earlier  in  figures  13,  14,  and  15,  the  clustering  step 
is  able  to  identify  similar  groups  of  signatures.  The  fusion  step  results  for  the  ANN  and  the  HME 
method  were  also  presented  and  analyzed.  The  results  show  that  both  the  HME  and  ANN  fusion 
methods  outperform  the  individual  models  and  that  the  HME  fusion  method  gives  better  results  than 
the  ANN  based  fusion.  Finally  the  eHMM  versus  the  baseline  HMM  ROCs  and  scatter  plots  show 
that  the  eHMM  with  both  ANN  and  HME  fusion  significantly  outperforms  the  baseline  two-model 
classifier. 
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CHAPTER  6 


CONCLUSIONS  AND  POTENTIAL  FUTURE  WORK 

6.1  Conclusions 

In  this  work,  we  have  proposed  a  novel  ensemble  HMM  classification  method  that  is  based 
on  clustering  sequences  in  the  log-likelihood  space.  The  eHMM  uses  multiple  HMM  models  and 
fuses  them  for  the  final  decision  making. 

We  hypothesized  that  the  data  are  generated  by  multiple  models.  These  different  models 
reflect  the  fact  that  the  different  classes  may  have  different  characteristics  accounting  for  intra-class 
variability  within  the  data.  Our  approach  has  four  main  steps.  First,  one  HMM  is  fit  to  each 
individual  sequence.  For  each  fitted  model,  we  evaluate  the  log-likelihood  of  each  sequence.  This 
results  in  a  log-likelihood-based  similarity  matrix.  Next,  the  similarity  matrix  is  partitioned  into 
K  groups  using  a  relational  clustering  algorithm.  Three  clustering  algorithms  were  evaluated:  the 
hierarchical  agglomerative  algorithm,  the  FLeCK  algorithm,  and  the  FCM-MK  algorithm. 

In  the  third  step,  we  initialize  the  parameters  of  one  HMM  per  cluster  using  the  observation 
sequences  belonging  to  that  cluster.  Then,  we  learn  the  parameters  of  the  HMM  models  using  an 
optimized  training  approach  for  the  different  K  groups  depending  on  their  size  and  homogeneity. 
For  clusters  that  are  dominated  by  sequences  from  only  one  class,  we  use  the  standard  Baum- Welch 
re-estimation  procedure.  For  clusters  with  a  mixture  of  observations  belonging  to  different  classes, 
we  use  discriminative  training  based  on  minimizing  the  misclassification  error  to  learn  a  model  for 
each  class.  Finally,  for  clusters  containing  a  small  number  of  sequences,  we  use  a  variational  Bayesian 
method  to  update  the  model  parameters  given  the  observed  data. 

To  test  a  new  sequence,  its  likelihood  is  computed  in  all  the  models  and  a  final  confidence 
value  is  assigned  by  combining  the  multiple  models  outputs  using  a  decision  level  fusion  method 
such  as  an  artificial  neural  network  or  a  hierarchical  mixture  of  experts. 

The  eHMM,  in  its  discrete  and  continuous  versions,  was  evaluated  on  two  real-world  appli¬ 
cations.  First,  we  applied  our  approach  to  identify  CPR  scenes  in  video  simulating  medical  crises. 
In  this  application,  we  used  a  small  dataset  of  2-dimensional  CPR  and  non-CPR  sequences  that 
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represent  the  motion  vectors  extracted  from  the  simulation  videos.  Therefore,  we  illustrated  and  vi¬ 
sualized  the  intermediate  steps  of  the  eHMM  and  compared  its  performance  to  the  baseline  HMM.  In 
the  second  application,  we  evaluated  the  eHMM  on  the  task  of  landmine  detection  using  GPR  data 
and  multiple  feature  representations.  In  particular,  we  applied  our  method  on  three  large  datasets 
of  GPR  data  using  EHD,  Gabor,  gradient,  and  bar  features.  In  this  application,  we  evaluated  the 
individual  components  of  the  eHMM  and  emphasized  on  its  overall  performance  compared  to  the 
baseline  HMM. 

Results  on  the  CPR  data  and  on  the  large  GPR  data  collections  show  that  the  proposed 
method  can  identify  meaningful  and  coherent  HMM  clusters  models  that  describe  different  properties 
of  the  data.  Each  HMM  mixture  component  models  a  group  of  data  that  share  common  attributes. 
The  experiments  show  that  the  proposed  eHMM  intermediate  results  are  inline  with  the  expected 
behavior.  The  results  indicate  that,  in  both  the  continuous  and  discrete  versions,  the  proposed 
method  outperforms  the  baseline  HMM  that  uses  one  model  for  each  class  in  the  data. 

6.2  Potential  future  work 

In  the  application  of  the  eHMM  to  the  GPR  data,  we  used  each  feature  extraction  separately 
to  build  the  eHMM  mixture  models.  Therefore,  multiple  models  can  be  learned  from  the  different 
features.  We  suggest  combining  each  of  the  single- feature-based  eHMMs  either  at  the  similarity 
matrix  level,  or  at  the  fusion  level.  At  the  similarity  matrix  level,  we  can  combine  the  similarity 
matrices  of  each  single-feature-based  eHMM  and  produce  a  weighted-average  similarity  matrix  that 
will  be  used  in  the  relational  clustering.  At  the  fusion  level,  we  can  adjust  the  ANN  or  HME  architec¬ 
ture  to  take  the  mixture  models  of  all  the  single- feature-based  eHMMs  as  input.  Subsequently,  the 
fusion  model  parameters  will  be  learned  from  the  training  data.  Finally,  we  can  invoke  a  relational 
clustering  method  that  simultaneously  clusters  the  training  data  using  several  similarity  matrices 
and  assigns  a  weight  to  each  similarity  matrix  in  each  cluster.  Those  weights  can  be  used  at  the 
decision  level  fusion  to  decide  which  single- feature-based  model  will  be  used  for  each  cluster. 

Future  work  also  includes  the  evaluation  of  the  eHMM  with  other  applications  such  as  face 
recognition,  gene  sequence  alignment,  and  handwritten  character  recognition. 
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